Cs 6505
Cs 6505
Cs 6505
Introduction
Welcome to Computability, Complexity and Algorithms, the introductory Theoretical Computer Science course for the Georgia
Tech Masters and Phd programs. In this course, we will ask the big questions, What is a computer? What are the limits of
computation? Are there problems that no computer will ever solve? Are there problems that cant be solved quickly? What kinds
of problems can we solve eciently and how do we go about developing these algorithms? Understanding the power and
limitations of algorithms helps us develop the tools to make real-world computers smarter, faster and safer.
We step away from the particulars of programming languages and computer architectures and instead take an abstract view of
computing. This viewpoint allows us to understand whether a computational problem is inherently easy or hard to solve,
independent of the specific implementation or machine we plan to use. These results have stood the test of time, being as
valuable to us today as they were when they were developed in the 70s.
Though the topic of the course is theory, understanding the material can have very practical benefits. You will learn a wealth of
tools and techniques that will help you recognize when problems you encounter in the real-world are intractable and when there
an ecient solution. This can save you countless hours that otherwise would have been spent on a fruitless endeavor or in reinventing the wheel.
We hope that you find such direct applications, and certainly by having followed the rigorous arguments made in this course,
you will have given yourself a kind of training. The athlete doesnt just practice on the field or court where the competition is
held; he goes to gym and when he returns he finds he is better prepared for the game. In a similar way, by taking this course,
you will improve your thinking. As you make engineering or even business decisions, you will find that you have better reasons
to prefer one strategy over another, and will be able to articulate more rigorous arguments in support of your ideas.
Well start our journey by going back to when we all first learned about functions.
However, this notion of a function is much dierent from the mathematical definition of a function:
A function is a set of ordered pairs such that the first element of each pair is from a set X (called the domain), the second
element of each pair is from a set Y (called the codomain or range), and each element of the domain is paired with exactly
one element of the range.
The first conceptionthe one many of us develop in schoolis that of a function as an algorithm: a finite number of steps to
produce an output from an input, while the second conceptionthe mathematical definitionis described in the abstract
language of set theory. For many practical purposes, the two definitions coincide. However, they are quite dierent: not every
function can be described as an algorithm, since not every function can be computed in a finite number of steps on every input.
In other words, there are functions that computers cannot compute.
In this lesson, we will focus on the forms of the inputs and output and leave the definition of the machine for later.
The inputs read by the machine must be in the form of strings of characters from some finite set, called the machines alphabet.
For example, the machines alphabet might be binary (0s and 1s), it might be based on the genetic code (with symbols A, C, G,
T), or it might be the ASCII character set.
Any finite sequence of symbols is called a string. Thus, 0110 is a string over the binary alphabet. Strings are the only input data
type that we allow in our model.
Sometimes, we will talk about machines having string outputs just like the inputs, but more often than not, the output will just be
binaryan up or down decision about some property of the input. You might imagine the machine just turning on one of two
lights, one for accept, or one for reject, once the machine is finished computing.
With these rules, an important type becomes a collection of strings. Maybe, its the set of strings that some particular machine
accepts, or maybe we are trying to design a machine so that it accepts strings in a certain set and no others, or maybe were
asking if its even possible to design a machine that accepts everything in some particular set and no others. In all these cases,
its a set of strings that we are talking about, so it makes sense to give this type its own name.
We call a set of strings a language. For example, a language could be a list of names, It could be the set of binary strings that
represent even numbersnotice that this set is infiniteor it could be the empty set. Any set of strings over an alphabet is a
language.
In addition to these standard set operations, we also define an operation for concatenating two languages. The concatenation
of A and B is just all strings you can form by taking a string from A and appending a string from B to it. In our examples, this set
would be 00 with first 0 coming from A and second from B. The string 011 with the 0 coming from A and the 11 coming
2
from B , and so forth. Of course, we can also concatenate a language with itself. Instead of writing AA , we often write A . In
general, when we want to concatenate a language with itself k times we write A to the kth power. Note that for k = 0 , this
defined as the language containing exactly the empty string.
When we want to concatenate any number of strings from a language together to form a new language, we use an operator
known as Kleene star. This can be thought of as the union of all possible powers of the language. When we want to exclude the
empty string, we use the plus operator, which insists that at least one string from A be used. Notice the dierence in starting
indices. So for example, the string 01001010 is in A . There is a way that I can break it up so that each part is in the language
A , and so as a whole the string can be thought of a concatenation of strings from A . Note that even A doesnt include infinite
sequences of symbols. Each individual string from A must be of finite length, and you are only allowed to concatenate a finite
number together.
For those who have studied regular expressions, this should seem quite familiar. In fact, one gets the notation of regular
expressions by treating individual symbols a languages. For example, 0 is the set of all strings consisting entirely of zeros.
Here we are treating the symbol 0 as a language unto itself. We will also commonly refer to , meaning all possible strings over
the alphabet . Here we are treating the individual symbols of the alphabet as strings in the language .
A one-to-one correspondence, by the way is a function that is one-to-one, meaning that no two elements of the domain get
mapped to the same element of the range, and also onto, meaning the every element of the range is mapped to by an element
of the domain. For example, here is a one-to-one correspondence between the integers 1 through 6 and the set of permutations
of three elements.
Now, in general, the existence of a one-to-one correspondence implies that the two sets have the same size that is, the same
number of elements. And this actually holds even for infinite sets. This is why we say that there are as many even natural
numbers as there are natural numbers: because its easy to establish a one-to-one correspondence between the two sets,
f (n) = 2n for example. Some examples of countably infinite sets are:
The set of nonnegative even numbers (a one-to-one correspondence to the natural numbers is f (x)
The set of positive odd numbers (with correspondence f (x)
The set of all integers (with correspondence f (x)
= x/2)
= (x 1)/2
= 2x if x is nonnegative and f (x) = 2x 1 if x is negative)
For our purposes, we want to show that the set of all strings over a finite alphabet is countable. Since computer programs are
always represented as finite strings, this will tell us that the set of computer programs is countable. The proof is relatively
straightforward. Recall that is the union of strings of size 0 with those of size 1 with those of size 2, etc. Our strategy will just
be to assign the first number to 0 , the next numbers 1 , the next to 2 , etc. Here is the enumeration for the binary alphabet.
The key is that every string gets a positive integer mapped to it under this scheme. Therefore,
The set of all strings over any alphabet is countable.
Suppose that our set of sets is S 0 , S1 , etc. Without loss of generality, well suppose that they are disjoint. (If they happen not
to be disjoint, we can always make them so by subtracting out from S k all the elements it shares with sets S 0 . . . Sk 1.)
Then the argument proceeds just as before. We assign the first numbers to S0 , the next to S 1 , etc. Every element in the union
must have a first set S k that it belongs to, and thus it will be counted in the enumeration.
It turns out that we can actually prove something even stronger than this statement here.
We can replace this word finite with the word countable, and say that
A countable union of countable sets is countable.
Notice that our current proof doesnt work. If we tried to count all of the elements of S 0 before any of the elements of S1 , we
might never get to the elements of S 1 , or any other set besides S0 . Nevertheless, this theorem is true. For convenience of
notation, we let the elements of S k be {x_k0, x_k1, } and then we can make each set S k a row in a grid.
Again, we cant enumerate row-by-row here because we would never finish the first row. On the other hand, can go diagonalby-diagonal, since each diagonal is finite. The union of all the set S_k is the union of all the rows, but that is the same as the
union of all the diagonals. Each diagonal being finite, we can then apply the original version of the theorem to prove that a
countable union of countable sets is countable.
Note that his idea proves that the rationals are countable. Imagine putting all fractions with a 1 in the numerator in the first row,
all those with a 2 in the numerator in the second row, etc.
For the proof, well suppose not. That is, suppose there is an enumeration of the languages L1 , L 2 , over an alphabet . Also,
let x1 , x2 , be the strings in . We are then going to build a table, where the columns correspond to the strings from Sigma*
and the rows correspond to the languages. In each entry in the table, well put a 1 if the string is in the language and a 0 if it is
not.
Now, we are going to consider a very sneaky language defined as follows: it consists of the those strings xi for which xi is not in
the language Li . In eect, weve taken the diagonal in this table and just reversed it. Since we are assuming that the set of
languages is countable, this language must be L k for some k . But is xk in L k or not? From the table, the row L k such that in
every column the entry is the opposite of what is on the diagonal. But the diagonal entry cant be the opposite of itself.
If xk is in Lk , then according to the definition of L k , it should not be in L k . On the other hand, if x k is not in Lk , then it should be
in Lk . Another way to think about this argument is to say that this opposite-of-the-diagonal language must be dierent from
every row in the table because it is dierent from the diagonal element. In any case, we have a contradiction and can conclude
that this enumeration of the languages was invalid since the rest of our reasoning was sound. This argument here is known as
the diagonalization trick, and well see it come up again later, when we discuss Undecidability.
and observe that the set of strings that represent valid python programs is a subset of . Each program is a finite string, after
all. (The choice of python arbitrarily any language or set of languages works.) Since this set is subset of the countable set ,
we have it is countable. Thus, there are a countable number of python programs.
On the other hand, consider this fact. For any language L , we can define FL to be the function that is 1 if x is in L and 0
otherwise. All such functions are distinct, so the set of these functions must be uncountable, just like the set of all languages.
Here is the profound point, since the set of valid python programs is countable but the set of functions is not, it follows that
there must be some functions that we just cant write programs for. In fact, there are uncountably many of them!
So going back to our picture of the HS classroom, we can see that the teacher, perhaps without realizing it was talking about
something much more general that what the student ended up thinking. There are only countably many computer programs that
can follow a finite number of steps as the student was thinking, but there are uncountably many functions that fit the teachers
definition.
Luckily one of the very first mathematical model of a computer serves us quite well, a model developed by Alan Turing in the
1930s. Decades before we had digital computers, Turing developed a simple model to capture the thinking process of a
mathematician. This model, which we now call the Turing machine, is an extremely simple device and yet completely captures
our notion of computability.
In this lesson well define the Turing machine, what it means for the machine to compute and either accept or reject a given
input. In future lessons well use this model to give specific problems that we cannot solve.
The input to the machine is a tape onto which the string input has been written. Using a read/write head the machine turns input
into output through a series of steps. At each step, a decision is made about whether and what to write to the tape and whether
to move it right or left. This decision is based on exactly two things:
the current symbol under the read-write head, and
something called the machines state, which also gets updated as the symbol is written.
Thats it. The machine stops when it reaches one of two halting states named accept and reject. Usually, we are interested in
which of these two states the machine halts in, though when we want to compute functions from strings to strings then we pay
attention to the tape contents instead.
Its a very interesting historical note that in Alan Turings 1936 paper in which he first proposed this model, the inspiration does
not seem to come from any thought of a electromechanical devise but rather from the experience of doing computations on
paper. In section 9, he starts from the idea of a person who he call the computer working with pen and paper, and then argues
that his proposed machine can do what this person does.
Lets follow his logic by considering my computing a very simple number: Alan Turings age when he wrote the paper.
Then he argues, that the fact that the grid is two-dimensional is just a convenience, so he takes away the paper and says that
computation can done on tape consisting of a one-dimensional sequence of squares. This isnt convenient for me, but it doesnt
limit what the computation I can do.
Then points out that there are limits to the width of human perception. Imagine I am reading in a very long mathematical paper,
where the phrase hence by theorem this big number we have is used. When I look back, I probably wouldnt be sure at a
glance that I had found the theorem number. I would have to check, maybe four digits at a time, crossing o the ones that I had
matched so as to not lose my place. Eventually, I will have matched them all and can re-read the theorem.
Since Turing was going for the simplest machine possible, he takes this idea to the extreme and only lets me read one symbol
be read at a time, and limits movement to only one square at time, trusting to the strategy of making marks on the tape to
record my place and my state of mind to accomplish the same things as I would under normal operation with pen and paper.
And with those rules, I have become a Turing machine.
So thats the inspiration, not a futuristic vision of the digital age, but probably Alan Turings own everyday experience of
computing with pen and paper.
We have the tape, the read/write head, which is connected to the state transition logic and a little display that will indicate the
halt statethat is, the internal state of the Turing machine when it stops.
Mathematically, a Turing machine consists of:
1. A finite set of states Q . (Everything used to specify a Turing machine is finite. That is important.)
2. An input alphabet of allowed input symbols. (This must NOT include the blank symbol which we will notate with this square cup
most of the time. For some of the quizzes where we need you to be able to type the character we will use b. We cant allow the
input alphabet to include the blank symbol or we wouldnt be able to tell where the input string ended.)
3. A tape alphabet of symbols that the read/write head can use (this WILL include the blank symbol)
4. It also includes a transition function from a (state,tape symbol) to a (state, tape symbol, direction) triple. This, of course, tells the
machine what to do. For every, possible current state and symbol that could be read, we have the appropriate response: the new
state to move to, the symbol to write to the tape (make this same an the read symbol to leave it alone), and the direction to move
the head relative to the tape. Note that we can always move the head to the right, but if the head is currently over the first position
on the tape, then we cant actually move left. When the transition function says that the machine should be left, we have it stay in
the same position by convention.
5. We also have a start state. The machine always starts in the first position on the tape and in this state.
At first, all of this notation may seem overwhelmingits a seven-tuple after all. Remember, however, that all the machine ever
does is to respond to the current symbol it sees based on its current state. Thus, its the transition function that is at the heart of
the machine and most all of the important information like the set of states and the tape alphabet is implicit in it.
One convenient way represent the transition function, by the way, is with a state diagram, similar to what is often used for finite
automata for those familiar with that model of computation. Each state gets its own vertex in a multigraph and every row of the
transition table is represented as an edge. The edge gets labeled with remaining information besides the two states, that is the
symbol that is read, the one that is written and the direction in which the head is moved.
See if you can trace through the operation of the Turing machine for the input shown. If you are unsure, watch the video.
Now, it isnt very practical always draw a picture like this one every time that we want to refer to the configuration of a Turing
machine, so we develop some notation that captures the idea.
Well do the same computation again, but this time well write down the configuration using this notation. We write the start
configuration as q0 1011. The part to the left of the state represents the contents of the tape to the left of the head. Its just the
empty string in this case. Then we have the state of the machine and then the rest of the tape contents.
After the first step, 1 is to the left of the head, we are state q0 still and 1 is the string to the right. In the next, configuration 10
is to the left, we are still in state q0 and 11 to right, and so on and so forth.
This notation is a little awkward, but its convenient for typesettings. Its also very much in the spirit of Turing machines, where
all structured data must ultimately be represented as strings. If a Turing machine can handle working with data like this, then so
can we.
At a slightly higher level, a whole sequence of configurations like this captures everything that a Turing machine did on a
particular input, and so we will sometimes call such a sequence a computation. And actually, this representation of a
computation will be central as when we discuss the Cook-Levin theorem in the section on complexity.
(Watch the video for an illustration of the state transitions in the diagram.)
function from all strings over the input alphabet to string over the tape alphabet, if you like.
For example, the Turing machine we just described decided the language L consisting of strings of the form w#w, where w is a
binary string. We might also say the Turing machine computed the function that is one if x is in L and 0 otherwise. Or even just
that the Turing machine computed L.
Contrast this definition with what it takes for a Turing machine to decide a language. Then it needs not only to accept everything
in the language but it must reject everything else. It cant loop like this Turing machine.
If we wanted to build a decider for this language we would need to modify the Turing machine so that it detects the end of the
string and move into the reject state.
At this point, it also makes sense to define the language of a machine, which is just the language that the machine recognizes.
After all, every machine recognizes some language, even if it is the empty one.
Formally, we define L(M) to be the set of strings accepted by M , and we call this the language of the machine M .
This statement is called the Church-Turing thesis, named for Alan Turing, whom we met in the previous lesson, and Alonzo
Church, who had an alternative model of computation known as the lambda-calculus, which turns out to be exactly as powerful
as a Turing machine. We call the Church-Turing thesis a thesis because it isnt a statement that we can prove or disprove. In this
lecture well give a strong argument that our simple Turing machine can do anything todays computers can do or anything a
computer could ever do.
To convince you of the Church-Turing thesis, well start from the basic Turing Machine and then branch out, showing that it is
equivalent to machines as powerful as the most advanced machines today or in any conceivable future. Well begin by looking
at multi-tape Turing machines, which in many cases are much easier to work with. And well show that anything a multi-tape
Turing machine can do, a regular Turing machine can do too.
Then well consider the Random Access Model: a model capturing all the important capabilities of a modern computer, and we
will show that it is equivalent to a multitape Turing machine. Therefore it must also be equivalent to a regular Turing machine.
This means that a simple Turing machine can compute anything your Intel i7or whatever chip you may happen to have in your
computercan.
Hopefully, by the end of the lesson, you will have understood all of these connections and youll be convinced that the ChurchTuring thesis really is true. Formally, we can state the thesis as:
a language is computable if and only if it can be implemented on a Turing machine.
This puts the tape head back in the spot it started without aecting the tapes contents. Except for occasionally taking an extra
movement step, this Turing machine will operate in the same way as the stay-put machine.
More precisely, we say that two machines are equivalent
if they accept the same inputs, reject the same inputs, and loop on the same inputs.
considering the tape to be part of the output, equivalent machines also halt with the same tape contents.
Note that other properties of the machines (such as the number of states, the tape alphabet, or the number of steps in any given
computation) do not need to be same. Just the relationship between the input and the output matters.
Since having multiple tapes makes programming with Turing machines more convenient, and since it provides a nice
intermediate step for getting into more complicated models, well look at this Turing machine variant in detail. As shown in the
figure here, each tape has its own tape head.
What the Turing machine does at each step is determined solely by its current state and the symbols under these heads. At
each step, it can change the symbol under each head, and moves each head right or left, or just keeps it where it is. (With a
one-tape machine, we always forced the head to move, but if we required that condition for multitape machines, the dierences
in tape head positions would always be even, which leads to awkwardness in programming. Its better to allow the heads to stay
put.)
Except for those dierences, multitape Turing machines are the same as single-tape ones. Well only need to redefine the
transition function. For a Turing machine with k tapes, the new transition function is
: Q k : Q k
L, R, S
k
Next, well do a little exercise to practice using multitape Turing machines. Again, the point here is not so that you can put
experience programming multitape Turing machines on your resume. The idea is to get you familiar with the model so that you
can really convince yourself of the Church-Turing thesis and understand how Turing machines can interpret their own
description in a later lesson.
With that in mind, your task is to build a two-tape TM that decides the language strings of the form x#y, where x is a substring
of y. So for example, the string 101#01010 is in the language.
The second through fourth character of y match x. But on the other hand, 001#01010 is not in the language. Even though two
0s and a 1 appear in the string 01010, 001 is not a substring because the numbers are not consecutive.
On the single tape, we have the contents of the multiple tapes, with each tapes contents separated by a hash. Also, note these
dots here. We are using a trick that we havent used before: expanding the size of the alphabet. For every symbol in the tape
alphabet of the multitape machine, we have two on the single tape machine, one that is marked by a dot and one that is
unmarked. We use the marked symbols to indicate that a head of the multitape machine would be over this position on the
tape.
Simulating a step of the multitape machine with the single tape version happens in two phases: one for reading and one for
writing. First, the single-tape machine simulates the simultaneous reading of the heads of the multitape machines by scanning
over the tape and noting which symbols have marks. That completes the first phase where we read the symbols.
Now we need to update the tape contents and head positions or markers as part of the writing phase. This is done in a second,
leftward pass across the tape. (See video for an example.)
Note that it is possible that one of these strings will need to increase its length when the multitape reaches a position it hasnt
reached before. In that case, we just right-shift the tape contents to allow room for the new symbol to be written.
Once all that work is done, we return the head back to the beginning to prepare for the next pass.
So, all the information about the configuration of a multitape machine can be captured in a single tape. It shouldnt be too hard
to convince yourself that the logic of reaching and keeping track of the multiple dotted symbols and taking the right action
should be as well. In fact, this would be a good for you to do on your own.
Instead of operating with a finite alphabet like a Turing machine, the RAM model operates with non-negative integers, which can
be arbitrarily large. It has registers, useful for storing operands for the basic operations and an infinite storage device analogous
to the tape of a regular Turing machine. Ill call this memory for obvious reasons. There are two key dierences between this
memory and the tape of a regular Turing machine:
1. each position on this device stores a number an
2. any element can be be read with a single instruction, instead of moving a head over the tape to the right spot.
In addition to this storage, the machine also contains the program itself expressed as a sequence of instructions and a special
register called the program counter, which keeps track of which instruction should be executed next. Every instruction is one of
a finite set that closely resembles the instructions of assembly code. For instance, we have the instruction read j, which reads
the contents from the jth address on the memory and places it in register 0. Register 0, by the way, has a special status and is
involved in almost every operation. We also have a write operation, which writes to the jth address in memory. For moving data
between the registers, we have load, which write to R 0 , and store which writes from it, as well as add, which increases the
number in R0 by the amount in R j . All of these operations cause the program counter to be incremented by 1 after they are
finished.
To jump around the list of instructionsas one needs to do for conditionalswe have a series of jump instructions that change
the program counter, sometimes depending on the value in R 0 .
And finally, of course, we have the halt instruction to end the program. The final value in R0 determines whether it accepts or
rejects. Note that in our definition here there is no multiplication. We can achieve that through repeated addition.
We wont have much use for the notation surrounding the RAM model, but nevertheless its good to write things down
mathematically, as this sometimes sharpens our understanding. In this spirit, we can say that a Random Access Turing machine
consists of :
- a natural number k indicating the number of registers
- and a sequence of instructions .
The configuration of a Random Access machine is defined by
- the counter value, which is 0 for the halting state and indicates the next instruction to be executed otherwise.
- the register values and the values in the memory, which can be expressed as a function.
(Note that only a finite number of the addresses will contain a nonzero value, so this function always has a finite representation.
Well use 1-based indexing, hence the domain for the tape is the natural numbers starting from one.)
E:
.
0, , ||
The blank symbol is mapped to to zero (
Being in state \)q_0$$$ corresponds to having the program counter point to the top line of the program, so the RAM will execute a sequence of tests
for what the symbol under the head would be, adjust the values on the tape or memory accordingly, and then jump to the appropriate line for the next
state.
Now we argue the other way: that a traditional Turing machine can simulate a RAM. Actually, well create a multitape Turing
machine that implements a RAM since that is a little easier to conceptualize. As weve seen, anything that can be done on a
multitape Turing machine can be done with a single tape.
We will have one tape per register, and each tape will represent the number stored in the corresponding register.
We also have another tape that is useful for scratch work in some of the instructions that involve constants like add 55.
Then we have two tapes corresponding to the random access device. One is for input and output, and the other is for
simulating the contents of the memory device during execution. Storing the contents of the random access device is the more
interesting part. This is done just by concatenating the (index, values) pairs using some standard syntax like parentheses and
commas.
The program of the RAM must be simulated by the state transitions of the Turing machine. This can be accomplished by having
a subroutine or sub-Turing machine for each instruction in the program. The most interesting of these instructions are the ones
involving memory. We simulate those by searching the tape that stores the contents of the RAM for one of these pairs that has
the proper index and then reading or writing the value as appropriate. If no such pair is found, then the value on the memory
device must be zero.
After the work of the instruction is completed, the eect of incrementing the program counter is achieved by transitioning to the
state corresponding to the start of the next instruction. That is, unless the instruction was a jump, in which case that transition is
eected. Once the halt instruction is executed the contents of the tape simulating the random access device are copied out
onto the I/O tape.
Universality - (Udacity)
Let M be a Turing machine with states Q = {q0 , qn1 } and tape alphabet = {a 1 , a m }. Define i and j so that 2 is at
j
least the number of states and 2 is at least the number of tape symbols. Then we can encode a state qk as the concatenation
of the symbol q with the string w, where w is the binary representation of k . For example, if there are 6 states, then we need
three bits to encode all the states. The state q3 would be encoded as the string q011. By convention, we make
the initial state q0 ,
the accept state q1 ,
the reject state q2 .
We use an analogous strategy for the symbols, encoding ak as the symbol a followed by the string w, where w is the binary
representation of k.
For example, if there are 10 symbols, then we need four bits to represent them all. If a5 is the star symbol, we would be encode
that symbol as a0101.
Lets see an encoding for an example.
This example decides whether the input consists of a number of zeros that is a power of two. To encode the Turing machine as
a whole we really just need to encode its transition function. Well start by encoding the black edge from the diagram. We are
going from state zero, seeing the symbol zero, and we go to state three, we write symbol 0 and we move the head to the right.
Remember that the order here is input state, input symbol. Then output state, output symbol, and finally direction.
The goal is to simulate Ms execution when given the input w, halting in an accept or reject state, or not halting at all, and
ultimately outputting the encoding of the output of M on w when M does halt. Well describe a 3-tape Turing machine that
achieves this goal of simulating M on w.
The input comes in on the first tape. First, well copy the description of the machine to the second tape and copy the initial state
to the third tape. For example, the tape contents might end up like this.
The we rewind all the heads and begin the second phase.
Here, we search for the appropriate tuple in the description of the machine. The first element has to match the current state
stored on tape three, and the symbol part has to match the encoding on the tape 1. If no match is found, then we halt the
simulation and put the universal machine in an accepting or rejecting state according to the current state of the machine being
interpreted. If there is a match, however, then we apply the changes to the first tape and repeat.
In order to decide a language, the Turing machine must not only accept every string in the language but it must also explicitly
reject every string that is not in the language. This machine achieves that by not looping on the blanks.
Now, looking at this someone might object, Shouldnt we say recognizable by a Turing machine and decidable by a Turing
machine? Of course, we could and the statements would still be true. But we dont, the reason being that we strongly believe
that if anything can do it a Turing machine can! Thats the Church-Turing thesis.
In an absolute sense, we believe that a language is only recognizable by anything if a Turing machine can recognize it and a
language is only decidable by anything if a Turing machine can decide it, and we use terms that reflect that belief.
Now other terms are sometimes used instead of recognizable and decidable. Some say that Turing machines compute
languages, so to go along with that they say that languages are computable if they there is a Turing machine that computes it.
Another equivalent term for decidable is recursive. Mathematicians often prefer this word.
And those who use that term will refer to recognizable languages as recursively enumerable. Some also call these languages
Turing-acceptable and semi or partially decidable.
We should also make clear the relationship between these two terms. Clearly, if a language is decidable, then it is also
recognizable; the same Turing machine works for both. It feels like it should also be true that if a language is recognizable and
its complement is also recognizable, then the language is decidable. This is true, but there is a potential pitfall here that we need
to make sure to avoid.
Suppose that we are given one machine M 1 that recognizes a language L and another M2 that recognizes { L} . If we were to
ask your average programmer to use these machines to decide the language, his first guess might go something like this.
This program will not decide L, however, and I want you to tell me why. Check the best answer.
In every step, we execute both machines one more step than in the previous iteration. Note that it doesnt matter if we save the
machines configuration and start where
we left o or start over. The question is whether we get the right answer, not how fast.
The string has to be either in L or in { L} , so one of these has to halt after some finite number of steps, and when i hits that
value, this program will give the right answer.
Overall then, we have the following theorem.
A language L is decidable if and only if L and its complement are both recognizable.
Halting on 34 - (Udacity)
Next, we consider the set of Turing machines that halt on the number 34 written in binary. Indicate whether L and L complement
are recognizable.
Every row in the table corresponds to a computation or the sequence of configurations the machine goes through for the given
input. Simulating all of these computation means hitting every entry in this table. Note that we cant just simulate M on the
empty string first or we might just keep going forever, filling out the first row and never getting to the second. This is the same
problem that we encountered when trying to show that a countable union of countable sets is countable, or that the set of
rational numbers is countable.
And the solution is the same too. We go diagonal by diagonal, first simulating the first computation for one step. Then the
second computation for one step and the first computation for two steps, etc. Eventually every configuration in the table is
reached.
Thus, if we are trying to recognize the language of Turing machine descriptions where the Turing machine accepts something,
then a Turing machine in the language must accept some string after finite number of steps.
This will correspond to some some entry in the table, so we eventually reach it and accept.
The diagonalization argument comes up in many contexts and is very useful for generating paradoxes and mathematical
contradictions. To show how general the technique is, lets examine it in the context of English adjectives.
Here Ive created a table with English adjectives both as the rows and as the columns. Consider the row to be the word itself
and the column to be the string representation of the word. For each entry, Ive written a 1 if the row adjective applies to the
column representation of the word. For instance, long is not a long word, so Ive written a 0. Polysyllabic is a long word, so
Ive written a 1. French is not a French word, its an English word, so Ive written a 0. And so forth.
So far, we havent run into any problems. Now, lets make the following definition: a heterological word is a word that expresses
a property that its representation does not possess. We can add the representation to the table without any problems. It is a
long, polysyllabic, non-French word. But when we try to add the meaning to the table, we run into problems. Remember: a
heterological word is one that express a property that its representation does not posseses. Long is not a long word, so it is
heterological. Polysyllabic is a polysyllabic word, so it is not heterological, and French is not a French word, so it is
heterological.
What about heterological, however? If we say that it is heterological (causing us to put a 1 here), then it applies to itself and so
it cant be heterological. On the other hand, if we say it is not heterological (causing us to put a zero here), then it doesnt apply
to itself and it is heterological. So there really is no satisfactory answer here. Heterological is not well-defined as an adjective.
For English adjectives, we tend to simply tolerate the paradox and politely say that we cant answer that question. Even in
mathematics the polite response was simply to ignore such questions until around the turn of the 20th century when
philosophers began to look for a more solid logical foundations for reasoning and for mathematics in particular.
Naively, one might think that a set could be an arbitrary collection. But what about the set of all sets that do not contain
themselves? Is this set a member of itself or not? This paradox posed by Bertrand Russell wasnt satisfactorily resolved until the
1920s with the formulation of what we now call Zermelo-Fraenkel set theory.
Or from mathematical logic, consider the statement, This statement is false. If this statement is true, then it says that it is false.
And if this statement is false, then it says so and should be true. It turns out that falsehood in this sense isnt well-defined
mathematically.
At this point, youve probably guessed where this is going for this course. We are going to apply the diagonalization trick to
Turing machines.
f (i, j) =
For this example, Ill fill out the table in some arbitrary way. The actual values arent important right now.
Now consider the language L, consisting of string descriptions of machines that do not accept their own descriptions. I.e.
Again we run into a problem. The row corresponding to ML is supposed to have the opposite values of what is on the diagonal.
But what about the diagonal element of this row? What does the machine do when it is given its own description? If it accepts
itself, then < M L > is not in the language L , so M L should have accepted itself. On the other hand, if ML does not accept its
string representation, then < M L > is in the language L , so ML should have accepted its string representation!
Thankfully, in computability, the resolution to this paradox isnt as hard to see as in set theory or mathematical logic. We just
conclude that the supposed machine M L that recognizes the language L doesnt exist.
Here is natural to object: Of course, it exists. I just run M on itself and if it doesnt accept, we accept. The problem is that M on
itself might loop, or it just might run for a very long time. There is no way to tell the dierence.
The end result, then, is that the language L of string descriptions of machines that do not accept their own descriptions is not
recognizable.
Recall that in order for a language to be decidable, both the language and its complement have to be recognizable. Since L is
not recognizable, it is not decidable, and neither is its complement, the language where the machine does accept its own
description. Well call this D_TM, D standing for diagonal,
Dumaflaches - (Udacity)
If you think back to the diagonalization of Turing machines, you will notice that we hardly referred to the properties of Turing
machines at all. In fact, except at the end, we might as well have been talking about a dierent model of computation, say the
dumaflache. Perhaps, unlike Turing machines, dumaflaches halt on every input. These models exist. A model that allowed one
step for each input symbol would satisfy this requirement.
How do we resolve the paradox then? Cant we just build a dumaflache that takes the description of a dumaflache as input and
then runs it on itself? It has to halt, so we can reject it if accepts and accept if it rejects, achieving the needed inversion. Whats
the problem? Take a minute to think about it.
w,
We write this relation between languages with the less-than-or-equal-to sign with a little M on the side to indicate that we are
referring to mapping reducibility.
It helps to keep in your mind a picture like this.
On the left, we have the language A , a subset of and the right we have the language B , also a subset of .
In order for the computable function f to be a reduction, it has to map
each string in
A to a string in B .
The mapping doesnt have to be one-to-one or onto; it just has to have this property.
This works because by the definition of a reduction x is in A if and only if R(x) is in B. And by the definition of a decider this is
true if and only if D accepts R(x). Therefore, the output of D tells me whether x is in A. If I can figure out whether an arbitrary
string is in B , then by the properties of the reduction, this also lets me figure out whether a string is in A . We can say that the
composition of the reduction with the decider for B is itself a decider for A.
Thus, the fact that A reduces to B has four important consequences for decidability and recognizability. The easiest to see are
If B is decidable, then A is also decidable. (As weve seen we can just compose the reduction with the decider for B. )
If B is recognizable, then A is also recognizable. (Same logic as above.)
Our strategy is to reduce the diagonal language to it. In other words, well argue that deciding B is at least as hard as deciding
the diagonal language. Since we cant decide the diagonal language, we cant decide B either.
Here is one of many possible reductions.
The reduction is a computable function whose input is the description of a machine M, and its going to build another machine
N in this python-like notation. First, we write down the description of a Turing machine by defining this nested function. Then we
return that function. An important point is that the reduction never runs the machine N: it just writes the program for it!
Note here that, in this example, N totally ignores the actual input that is given to it. It just accepts if M() accepts; otherwise, it
loops or rejects. Hence, N is either going to be a machine that accepts everything or a machine that doesnt accept anything
depending on the behavior of M.
In other words, the language of N will be the empty set in one case and Sigma-star in the other. A decider for B would be able to
tell the dierence, and therefore tell us whether M accepted its own description. Therefore, if B had a decider, we would be able
to decide the diagonal language, which is impossible. So B cannot be decidable.
is undecidable.
Well do this by reducing from the diagonal language. That is, well show the halting problem is at least as hard as the diagonal
problem. Here is one of many possible reductions.
The reduction creates a machine N that simply ignores its input and runs M on itself. If M rejects itself, then N loops. Otherwise,
N accepts.
At this point it might seem weve just done a bit of symbol manipulation but lets step back and realize what weve just seen. We
showed that no Turing machine can tell whether or not a computer program will halt or remain in a loop forever. This is a
problem that we care about it and we cant solve it on a Turing machine or any other kind of computer. You cant solve the
halting problem on your iPhone. You cant solve the halting problem on your desktop, no matter how many cores you have. You
cant solve the halting problem in the cloud. Even if someone invents a quantum computer, it wont be able to solve the halting
problem. To misquote Nick Selby: If you want to solve the halting problem, youre at Georgia Tech but you still cant do that!
We run the input machine M on the empty string. If M loops, then so will N . We dont accept one string, we accept none. On
the other hand, if M does halt on the empty string, then we make N act like a machine in the language S. The empty string is as
good as any, so well test to see if N s input x is equal to that and accept or reject accordingly. This works because if M halts on
the empty string, then N accepts just one stringthe empty oneand so is in S . On the other hand, if M doesnt halt on the
empty string, then N wont halt on (and therefore wont accept) anything, and therefore N isnt in L.
In the one case, the language of N is the empty string. In the other case, the language of N is the empty set. A decider for the
language S can tell the dierence, and therefore wed be able to decide if M halted on the empty string or not. Since this is
impossible, a decider for S cannot exist.
0n 1n - (Udacity)
Now for a slightly more challenging reduction. Reduce H { } to the language L of Turing machines that accept strings of the
n n
form 0 1 .
1. Membership can only depend on the set of strings accepted by M, not about the machine M itselflike the number of states or
something like that.
2. The language cant be trivial, either including or excluding every Turing machine. Well assume that there is a machine,
language and another,
M1 , in the
Recall that in all our reductions, we created a machine N that either accepts nothing or else it has some other behavior
depending on the behavior of the input machine M.
Similarly, there are two cases for Rices theorem: either the empty set is in P and therefore every machine that doesnt accept
anything is in the language L, or else the empty set is not in P. Lets look at the case where the empty set is not in P first.
In that case, we reduce from H TM . The reduction looks like this. N just runs M with empty input. If M halts, then we define N to
act like the machine M 1 .
Thus, N acts like M1 (a machine in the language L ) if M halts on the empty string, and loops otherwise. This is exactly what we
want.
Now for the other case, where the empty set is in P.
In this case, we just replace M 1 by M 2 in the definition of the reduction, so that N behaves like M 2 if M halts on the empty
string.
This is fine, but we need to reduce from the complement of H TM : that is, from the set of descriptions of machines that loop on
the empty input. Otherwise, we would end up accepting when we wanted to not accept and vice-versa.
All in all then, we have proved the following theorem.
Slightly more intuitively, we can say the following. Let L be a subset of strings representing Turing machines having two key
properties:
1. if M 1 and M 2 accept the same set of strings, then their descriptions are either both in or both out of the languagethis just says
that the language only depends on the behavior of the machine not its implementation.
2. the language cant be trivialthere must be a machine whose description is in the language and a machine whose description is not
in the language.
The promise of our study of computability was to better appreciate the dierence between these understandings, and I hope
you will agree that we have achieved that. We have seen how there are many functions that are not computable in any ordinary
sense of the word by a counting argument. We made precise what we meant by computation, going all the way back to Turings
inspiration from his own experience with pen and paper to formalize the Turing machine. We have seen how this model can
compute anything that any computer today or envisioned for tomorrow can. And lastly, we have described a whole family of
uncomputable functions through Rices theorem.
P represents the problems we can solve quickly, like your GPS finding a short route to your destination. NP represents problems
where we can check that a solution is correct, such as solving a Sudoku puzzle. In this lesson we will learn about P, NP, and
their role in helping us understand what we may or may not be able to solve quickly.
Given all this information, there are several types of analysis that you might want to do, some easier and some harder. For
instance, if you wanted to run a dating service, you would be in pretty good shape. Say that you wanted to maximize the
number of matches that you make and hence the number of happy customers. Or perhaps, you just want to know if its possible
to give everyone a date. Well, we have ecient algorithms for finding matchings, and well see some in a future lesson. Here, in
this example, its possible to match everyone up, and such a matching is fairly easy to find.
Contrast this with problem of identifying cliques. By clique, I mean a set of people who are all friends with each other. For
instance, here is a clique of size three: every pair of members has an edge between, and cliques of that size arent too hard to
find.
As we start to look for larger cliques, however, the problem becomes harder and harder to solve.
There is one more class of problems that well talk about in this section on complexity, and that is the set of NP-complete
problems. These are the hardest problems in NP, and we call them the hardest, because any problem in NP can be eciently
transformed into an NP-complete problem. Therefore, if someone were to come up with a polynomial algorithm for even one
NP-complete problem, then P would expand out in this diagram, making P and NP into the same class. Finding a polynomial
solution for clique would do this, so we say that clique is NP-complete. Since solving problems doesnt seem to be as easy as
checking answers to problems, we are pretty sure that NP-complete problems cant be solved in polynomial time, and therefore
that P does not equal NP.
To computer science novices the dierence between matching and clique might not seem to be a big deal and it is surprising
that one is so much harder than the other. In fact, the dierence between a polynomially solvable problem and a NP-complete
one can be very subtle. Being able to tell the dierence is an important skill for anyone who will be designing algorithms for the
real world.
This explains why your phone can give you directions but supply chain logisticsjust figuring out how things should be routed
is a billion dollar industry.
Actually, however, we dont even need to change the shortest path so much to get an NP-complete problem. Instead of asking
for the shortest path, we could ask for the longest, simple path. We have to say simple so that we dont just go around in cycles
forever.
This isnt the only possible pairing of similar P and NP-complete problems either. Im going to list some more. If you arent
familiar with these problems yet, dont worry. You will learn about them by the end of the course. Vertex cover in bipartite graphs
in polynomial, but vertex cover in general graphs is NP complete.
An class of optimization problems called Linear Programming is in P, but if we restrict the solutions to integers, then we get an
NP-complete problem Finding an Eulerian cycle in a graph where you touch each edge once is polynomial. On the other hand,
finding a Hamiltonian cycle that touches each vertex once is NP-complete
And lastly, figuring out whether a boolean formula with two literals per clause is polynomial, but if there are three literals per
clause, then the problem is NP-complete.
Unless you are familiar with some complexity theory, problems in P arent always easy to tell from those that are NP-complete.
Yet, in the real world when you encounter a problem it is very important to know which sort of problem you are dealing with. If
your problem is like one of the problems in P, then you know that there should be an ecient solution and you can avail yourself
of the wisdom of many other scientists who have thought hard about how to eciently solve these problems. On the other
hand, if your problem is like one the NP-complete problems, then some caution is in order. You can expect to be able to find
exact solutions for small enough instances, and you may be able to find a polynomial algorithm that will give an approximate
solution that is good enough, but you should not expect to find an exact solution that will scale well for all cases.
Being able to know which situation you are in is one of the main practical benefits of studying complexity.
Lets illustrate this idea with an example. Consider the single-tape machine in the figure above that takes binary input and tests
whether the input contains a 1. Lets figure out the running time for string of length 2. We need to consider all the possible
strings of length 2, so we make a table and count the number of steps. The largest number of steps in 3, where we read both
zeros and then the blank symbol. Therefore, f(2) = 3.
3
Note that the big-O notation does not have to create a tight bound. Thus, g = O(n ) too. Setting c = 1 and N = 3 works for
this.
An n time machine here is one with running time order n as we defined these terms a few minutes ago.
Perhaps the most interesting thing about this definition is the choice for any k in the natural numbers. Why is this the right
definition? After all, if k is 100 then deciding the language isnt tractable in practice.
The answer is that P doesnt exactly capture what is tractable in practice. Its not clear that any mathematical definition would
stand the test of time in this regard, given how often computers change, or be relevant in so many contexts. This choice does
have some very nice properties however.
1. It matches tractability better than one might think. In practice, k is usually low for polynomial algorithms, and there are plenty of
interesting problems not known to be in P.
2. The definition is robust to changes to the model. That is to say, P is the same for single-tape, multi-tape machines, Random
Access machines and so forth. In fact, we pointed out that the running times for each of those models are polynomially related
when we introduced them.
3. P has the nice property of closure under the composition of algorithms. If one algorithm calls another algorithm as a subroutine a
polynomial number of times, then that algorithm is still polynomial, and the problem it solves is in P. In other words, if we do
something ecient a reasonably small number of times, then the overall solution will be ecient. P is exactly the smallest class of
problems containing linear time algorithms and which is closed under composition.
A natural way to represent the graph as a string is to write out its adjacency matrix in scanline order as done in the figure above.
But this isnt the only way to encode the graph. We might do something rather inecient.
The scanline encoding for this graph represents the number 170 in binary. We could choose to represent the graph in essentially
unary.
We might represent the graph as 342 zeros followed by 170 ones. The fact that there are 29=512 symbols total indicates that its a
3x3 matrix, and converting 170 back into binary gives us the entries of the adjacency matrix.
This is a very silly encoding, but there is nothing invalid about it.
This language, it turns out, is in P, not because it allows the algorithm to exploit any extra information or anything like that, but
just because the input is so long. The more sensible, concise encoding isnt known to be in P (and probably isnt, by an
overwhelming consensus of complexity theorists). Thus, a change in encoding can aect whether a problem is in P, yet its
ultimately problems that we are interested in, independent of the particulars of the encoding.
We deal with this problem, essentially by ignoring unreasonable representations like this one. As long as we consider any
reasonable encoding (think about what xml or json would produce from how you would store it in computer memory) then the
particulars wont change the membership of the language in P, and hence we can talk at least informally about problems being
in P or not.
On the other hand, in a nondeterministic computation, we start in a single initial configuration, but its possible for there to be
multiple successor configurations. In eect, the machine is able to explore multiple possibilities at once. This potential splitting
continues at every step. Sometimes there might just be one possible successor state, sometimes there might be three or more.
For each branch, we have all the same possibilities as for a deterministic machine.
It can reject.
It can loop forever.
It can accept.
If the machine ever accepts in any of the these branches, then the whole machine accepts.
The only change we need to make to the 7-tuple of the deterministic Turing machine needed to make it nondeterministic is to
modify the transition function. An element of the range is no longer the set of single (state, tape-symbol to write, direction to
move) tuple, but a collection of all such possibilities.
: Q {S | S Q {L, R}}
This set of all subsets used in the range here is often called a power set.
The only other change that needs to be made is when the machine accepts. It accepts if there is any valid sequence of
configurations that results in an accepting state. Naturally, then it also rejects only when every branch reaches a reject state. If
there is a branch that hasnt rejected yet, then we need to keep computing in case it accepts.
Therefore, a nondeterministic machine that never accepts and that loops on at least one branch will loop.
Think of the flow diagram as capturing various modules within the deterministic Turing machine. We start by initializing some
number p to 1. Then we increment it, and test whether p-squared is greater than x. If it is, then trying larger values of p wont
help us, and we can reject. If p-squared is no larger than x, however, then we test to see if p divides x. If it does, we accept. If
not, the we go try the next value for p.
Each iteration of this loop require a number of steps that is polynomial in the number of bits used to represent x. The trouble is
that we might end up needing x iterations of this outer loop here in order to find the right p or confirm that one doesnt exist.
This what makes the deterministic algorithm slow. Since the value of x is exponential in its input sizeremember that it is
represented in binarythis deterministic algorithm is exponential.
On the other hand, with nondeterminism we can do much better. We initialize p so that it is represented on its own tape as the
number 1 written in binary. Then we nondeterministically modify p. By having two possible transitions for the same state and
symbol pair, we can non-deterministically append a bit to p. (The non-deterministic transitions are in orange.)
Next, we check to see if we have made p too large. If we did, then there is no point in continuing, so we reject.
On the other hand, if p is not too big, then we nondeterministically decide either to
append a zero to p ,
append a 1 to p , or
If there is some p that divides x, then some branch of the computation will set p accordingly. That branch will accept and so the
whole machine will. On the other hand, if no such p exists, then no branch will accept and the machine wont either. In fact, the
machine will always reject because every branch of computation will be rejected in one of the two possible places.
This non-deterministic strategy is faster because it only requires log x iterations of this outer loop. The divisor p is set one bit at
a time and cant use more bits than x, the number its supposed to divide.
Thus, while the deterministic algorithm we came up with was exponential in the input length, it was fairly easy to come up with a
nondeterministic one that was polynomial.
And the running time of the machine as a whole is the maximum number of steps used on any branch of the computation. Note
that once we have a bound the length of any accepting configuration sequence, we can avoid looping by just creating a
timeout.
NP is the set of languages recognized by an O(n
In other words, its the set of languages recognized in polynomial time by a nondeterministic machine. NP stands for
nondeterministic polynomial time.
Nondeterminism can be a little confusing, but it helps to remember that a string is recognized if it leads to any accepting
computationi.e. any accepting path in this tree. Note that any Turing machine that is a polynomial recognizer for a language
can easily be turned into a polynomial decider by adding a timeout since all accepting computations are bounded in length by a
polynomial.
But one branch will choose the correct subset and this will accept.
And, thats all we need. If one branch accepts then the whole non-deterministic machines does, as it should. There is a clique of
size 4 here.
In other words, for every string w L, there is a certificate c that can be paired with it so that V will accept, and for every
string not in L , there is no such string c . Its intuitive to think of w as a statement and of c as the proof. If the statement is true,
then there should be a proof for it that V can check. On the other hand, if w is false, then no proof should be able to convince
the verifier that it is true.
A verifier is polynomial if its running time is bounded by a polynomial in
w.
Note that this w is the same one as in the definition. It is the string that is a candidate for the language. If we included the
certificate c in the bound, then it becomes meaningless since we could make c be as long as necessary. Thats a polynomial
verifier.
We claim
The set of Languages that have polynomial time verifiers is the same as NP.
The key to understanding this connection is once again this picture of the tree of computation performed by the
nondeterministic machine.
If a language is in NP, then there is some nondeterministic machine that recognizes it, meaning that for every string in the
language there is an accepting computation path. The verifier cant simulate the whole tree of the nondeterministic machine in
polynomial time, but it can simulate a single path. It just needs to know which path to simulate.
But this is what the certificate can tell it. The certificate can act as directions for which turns to make in order to find the
accepting computation of the nondeterministic machine. Hence, if there is nondeterministic machine that can recognize a
language, then there is a verifier that can verify it.
Now, well argue the other direction. Suppose that V verifies a language. Then, we can build a nondeterministic machine whose
computation tree will look a bit like a jellyfish. It the very top, we have a high degree of branching as the machine
nondeterministically appends a certificate c to its input.
Then it just deterministically simulates the verifier. If there is any certificate that causes V to accept, the nondeterministic
machine will find it. If there isnt one, then the nondeterministic machine wont.
Which is in NP - (Udacity)
Now that weve defined NP and defined what it means to be verifiable in polynomial time, I want you to apply this knowledge to
decide if several problems are in NP. First, is a graph connected? Second, does a graph have a set of k vertices with no edges
between? This is called the independent set problem. And lastly, will a given Turing machine M accept exactly one string?
NP-Completeness - (Udacity)
Introduction - (Udacity, Youtube)
This lecture covers the theory of NP-completeness, the idea that there are some problems in NP so general and so expressive
that they capture all of the challenges of solving any problem in NP in polynomial time. These problems provide important
insight into the structure of P and NP, and form the basis for the best arguments we have for the intractability of many important
real-world problems.
Clearly, P is contained inside of NP, and we are pretty sure that this containment is strict. That is to say that there are some
problems in NP but not in P, where the answer can be verified eciently, but it cant be found eciently. In this picture then, you
can imagine the harder problems being at the top.
Now, suppose that you encounter some problem where you know how to verify the answer, but where you think that finding an
answer is intractable. Unfortunately, your boss or maybe your advisor doesnt agree with you and keeps asking for an ecient
solution.
How would you go about showing that the problem is in fact intractable? One idea is to show that the problem is not in P. That
would indeed show that it is not tractable, but it would do much more. It would show that P is not equal to NP. You would be
famous. As we talked about in the last lecture, whether P is equal to NP is one of the great open questions in mathematics. Lets
rule option out. I dont want to discourage you from trying to prove this theorem necessarily. You just should know what you are
getting into.
Another possible solution is to show that if you could solve your problem eciently then it would be possible to solve another
problem eciently, one that is generally considered to be hard. If you were working in the 1970s you might have shown that a
polynomial solution to your problem would have yielded a polynomial solution to linear programming. Therefore, your problem
must be at least as hard a linear programming. The trouble with this approach is that it was later shown that linear programming
actually was polynomially solvable. Hence, the fact your problem is as hard as linear programming doesnt mean much
anymore. The class P swallowed linear programming. Why couldnt it swallow your program as well? This type of argument isnt
worthless, but its not as convincing as it might be.
It would be much better to reduce your problem to a problem that we knew was one of the hardest in the class NP, so hard that
if the class P were to swallow it would have to have swallowed all of NP. In other words, we would have to move the P
borderline all the way to the top. Such a problem would have to be NP-complete, meaning that we can reduce every language
in NP to it. Remarkable as it may seem, it turns out that there are lots of such languagessatisfiability being the first for which
this was proved. In other words, we know that it has to be at the top of this image. Turning back to how to show that your
problem is intractable, short of proving that P is not equal to NP, the best we can do is to reduce an NP-complete problem like
SAT to your problem. Then, your problem would be NP-Complete too, and the only way your problem could be polynomially
solvable is if everything in NP is.
There are two parts to this argument.
The first is the idea of a reduction. Weve seen reductions before in the context of computability. Here the reductions will not only
have to be computable but computable in polynomial time. This idea will occupy the first half of the lesson.
The second half will consider this idea of NP-completeness and we will go over the famous Cook-Levin theorem which show that
boolean satisfiability is NP-complete
w A f (w) B.
The key dierence from before is that we have now required that the function be computable in polynomial time, not just that it
be computable. We will also say that f is a polynomial time reduction of A to B.
Here is key implication of there being a polynomial reduction of one language to another. Lets suppose that I want to be able to
know whether a string x is in the language A , and suppose also that there exists a polynomial time decider M for the language
B.
Then, all I need to do is take the machine or program that computes this functionlets call it N
and feed my string x into it, and then feed that output into M. The machine M will tell me if f (x) is in B. By the definition of a
reduction, this also tells me whether x is in A, which is exactly what I wanted to know. I just had to change my problem into one
encoded by the language B and then I could use Bs decider.
Therefore, the composition of M with N is a decider for A by all the same arguments we used in the context of computability.
But is it a polynomial decider?
Just convert the input string to prepare it for the decider for B, and return what the decider for B says.
These two vertices here do not form an independent set because there is an edge between them.
However, these three vertices do form an independent set because there are no edges between them.
S.
Clearly, each individual vertex forms an independent set, since there isnt another vertex in the set for it to have an edge with,
and the more vertices we add the harder it is to find new ones to add. Finding a maximum independent set, therefore, is the
interesting question. Phrased as a decision problem, the question becomes given a graph G, is there an independent set of
size k ?
To illustrate, consider this graph here. These three shaded vertices do not form a vertex cover.
On the other hand, the two vertices do form a vertex cover because every edge is incident on one of these two.
Clearly, the set of all vertices is a vertex cover, so the interesting question is how small a vertex cover can we get. Phrased as a
decision question where we are give a graph G, it becomes Is there a vertex cover of size k ?
Note that this problem is in NP. Its easy enough to check whether a subset of vertices is of size k and whether it covers all the
edges.
Notice that both in these examples shown here and in the exercises, the set of vertices used in the vertex cover was the
complement of the vertices used the independent set. Lets see if we can explain this.
The set S is an independent set if there are no edges within S. By within, I mean that both endpoints are in S.
Thats equivalent to saying that every edge is not within Sor that every edge is incident on V
But that just says that V
S.
S is a vertex cover! The complement of an independent set was always a vertex cover and vice-versa.
S is a vertex cover.
As a corollary then,
A graph G has an independent of size at least k if and only if it contains a vertex cover of size at most size V
k.
The reduction is therefore fantastically simple: Given a graph G and a number k, just change k into V k.
Lets take a look at the proof of this theorem. Let M be the program that computes the function that reduces A to B, and let N
be the program that computes the function that reduces B to C. To turn an instance of the problem A into an instance of the
problem C, we just pass it through M and then pass that result through N.
Recalling our picture of P and NP from the beginning of the lesson the NP-complete problems were at the top of NP, and we
called them the hardest problems in NP. We cant have anything higher thats still in NP, because if its in NP, then in can be
reduced to an NP-complete problem. Also, if any one NP-complete problem were shown to be in P, then P would extend up and
swallow all of NP.
Its not immediately obvious that an NP-complete problem even exists, but it turns out that there are lots of them and in fact
they seem to occur more often in practice than problems than problems in the intermediate zone, which are not NP-complete
and so far have not been proved to be in P either.
Historically, the first natural problem to be proved to be NP-Complete is called Boolean formula satisfiability or SAT for short.
SAT is NP-Complete
This was shown to be NP-Complete by Stephen Cook in 1971 and independently by Leonid Levin in the Soviet Union around
the same time. The fact that this problem is NP-complete is extremely powerful, because once you have one NP-Complete
problem, you just need to reduce it to other problems in NP to show that they too are NP-complete. Thus, much of the theory of
complexity can be said to rest on this theorem. This is the high point of our study of complexity.
First, consider the operators. The indicates a logical OR, the indicates a logical AND, and the bar over top indicates
logical NOT.
For one of these formulas, we first need a collection of variables, x, y, z for this example. These variables appear in the formula
as literals. A literal can be either a variable or the variables negation. For example, the x, or y , etc.
At the next higher level, we have clauses, which are disjunctions of literals. You could also say a logical OR of literals. One
clause is what lies inside of a parentheses pair.
Finally, we have the formula as a whole, which is a conjunction of clauses. That is to say, all the clauses get ANDed together.
Thus, this whole formula is in conjunctive normal form. In general, there can be more than two clauses that get ANDed together.
That covers the terms well use for the structure of a CNF formula.
As for satisfiability itself, we say that
A boolean formula is satisfiable if there is a truth assignment for the formula, a way of assigning the variables true and false
such that the formula evaluates to true.
The CNF-satisfiability problem is: given a CNF formula, determine if that formula has a satisfying assignment.
Clearly, this problem is in NP since given a truth assignment, it is takes time polynomial in the number of literals to evaluate the
formula. Thus, we have accomplished the first part of showing that satisfiability in NP-complete. The other part, showing that
every problem in NP is polynomial-time reducible to it, will be considerably more dicult.
Each configuration in the sequence is represented by a row, where we have one column for the state, one for the head position
and then columns for each of the values of the first p(n) squares of the tape. Note that no other squares can written to because
there just isnt time to move the head that far in only p(n) steps.
Of course, the first row must represent the initial configuration and the last one must be in an accepting state in order for the
overall computation to be accepting.
Note that its possible that the machine will enter an accepting state before step p(n), but we can just stipulate that when this
happens all the rest of the rows in the table have the same values. This is like having the accept state always transition to itself.
The Cook-Levin theorem them consists of arguing that the existence of an accepting computation is equivalent to being able to
satisfy a CNF formula that captures all the rules for filling out this table.
We let Qik represent whether after step i , the state is qk . Similarly, for the position column, we define H ij to represent whether
the head is over square j after step i. Lastly, for the tape contents, we define S ijk to represent whether after step i, the square j
contains the k th symbol of the tape alphabet.
Note that as weve defined these variables, there are many truth assignments that are simply nonsensical. For instance, every
one of these Q variables could be assigned a value of True, but in a given configuration sequence the Turing machine cant be
in all states at all times. Similarly, we cant assign them to be all False; the machine has to be in some state at each time step.
We have the same problems with the head position variables and the variables for the squares of the tape.
All of this is okay. For now, we just need a way to make sure that any accepting configuration sequence has a corresponding
truth assignment, and indeed it must. For any way of filling out the tableau, the corresponding truth assignment is uniquely
defined by these meanings. It is the job of the boolean formula to rule out truth assignments that dont correspond to valid
accepting computations.
Note that r denotes the number of states here. In this context, it is just a constant. (The input to our reduction is the string that
we need to transform into a boolean formula, not the Turing Machine description).
The machine also cant be in two states at once, so we need to enforce that constraint as well by saying that for every pair of
state variable for a give time step, one of the two has to be false.
(Q ij Q ij ) i, j j
Together these sets of clauses enforce that the machine corresponding to a satisfying truth assignment is in exactly one state
after each step.
For the position of the head, we have a similar set of clauses. The head has to be over some square on the tape, but it cant be
in two places.
(Hi0 H i1 H ip(n) ) i
(H ij H ij ) i, j j
Lastly, each square on the tape has to have exactly one symbol. Thus, for all steps i and square j, there has to be some symbol
on the square, but there cant be two.
(S ij0 S ij1 S ij|| ) i and j
(S ijk S ijk ) i, k k
So far, we have the clauses
The other clauses related to individual configurations come from the start and end states.
The machine must be in the initial configuration at step 0. This means that the initial state must be q0 .
Q 00
The head must also be over the first position on the tape.
H 01
The first part of the tape contain the input w. Letting k1 k 2 k |w| be the encoding for the input string, we
(S 01k 1 ) (S 02k 2 ) (S 0|w|k |w| )
The rest of the tape must be blank to start.
(S 0(|w|+1)0 ) (S 0(|w|+2)0 ) (S 0p(n)0 )
These additional clauses are summarized here
Suppose that the transition functions tells us that if the machine is in state q3 and it reads a 1. Then it do one of two things:
it can stay in q0 , leave the 1 alone, and move the head to the right,
it can move to state q4 , write a 0 and move the head to the left.
To translate this rule into clauses for our formula, its useful to make a few definitions. First we need to enumerate the tape
alphabet so that we can refer to the symbols by number.
Next, we define a tuple (k, , k , , ) to be valid if the triple of the state q_{k }, the symbol s_{ } , and the direction delta is
in the set delta of state qk ,
( Hij Qi3 S ij0
It starts with three literals that ensure the transition rules for being q3 and reading the blank symbol actually do apply. If the head
isnt in the position were talking about, the state isnt q_3 , or the symbol being read isnt the blank symbol, then the clause is
immediately satisfied. The clause can also be satisfied if the machine behaves in any way that is dierent from what this
particular invalid transition would cause to happen.
H(i+1)(j+1) Q(i+1)4 S (i+1)(j+1)2 )
The head could move dierently, the state could change dierently, or a dierent symbol could be written.
Another way to parse this clause is as saying if the ( q3 , ) transition rule applies, then the machine cant have changed in this
particular invalid way. Logically, remember that A implies not B is equivalent to not A or not B. Thats the logic we are using here.
Now lets state the general rule for creating all the transition clauses. Recall that a tuple (k, , k , , ) is valid if switching to
state q_{k }, writing out symbol and moving in direction is an option given that the machine is currently in state qk and
reading symbol s_.
For every step i, position j, and invalid tuple then, we include in the formula this clause. The first part tests to whether the truth
assignment is such that the transition rule applies
and the next three ensure that this invalid transition wasnt followed. This is just the generalization of the example we saw
earlier.
That way, any algorithm for deciding satisfiability in polynomial time would be able to decide every language in NP.
Only two questions remain.
Is the reduction correct?
Is it polynomial in the size of the input?
Lets consider correctness first. If x is in the language, then clearly the output formula f is satisfiable. We can just use the truth
assignment that corresponds to an accepting computation of the nondeterministic Turing machine that accepts x. That will
satisfy the formula f . That much is certainly true.
How about the other direction? Does the formula being satisfiable imply that x is in the language? Take some satisfying
assignment for f . As weve argued,
the corresponding tableau is well-defined. Only one of the state variables can be true at any time step, etc,
the tableau starts in the initial configuration,
every transition is valid,
the configuration sequence ends in an accepting configuration.
Thats all that is needed for a nondeterministic Turing machine to accept, so this direction is true as well.
Now, we have to argue that the reduction is polynomial.
First, I claim that the running time is polynomial in the output formula length. There just isnt much calculation to be done
besides iterating over the combination of steps, head positions, states, and tape symbols and printing out the associated terms
of the formula.
Second, I claim that the output formula is polynomial in the input length. Lets go back and count
These clauses pertaining to the states require order p(n) log n string length. The number of states of the machine is a constant
in this context. The p factor comes from the number of steps of the machine. The log n factor comes from the fact that we have
to distinguish the literals from one another. This requires log n bits. In all these calculations, thats where the log factor comes
from.
For the head position, we have O(p(n)3logn) string length one factor of p from the number of time steps and two from all pairs
of head positions.
2
There are O(p ) combinations of steps and squares, so this family of clauses requires O(p logn) length as well.
The other clauses related to individual configurations come from the start and end states. The initial configuration clause is a
mere O(p(n) log n)
The constraint that the computation be accepting only requires O(log n) string length.
The transition clauses might seem like they would require a high order polynomial of symbols, but remember that the size of the
nondeterministic Turing machine is a constant in this context. Therefore, the fact that we might have to write out clauses for all
pairs of states and tape symbols doesnt aect the asymptotic analysis. Only the range of the indices i and j depend on the size
2
of the input string, so we end up with O(p log n).
Adding all those together still leaves us with a polynomial string length. So, yes, the reduction is polynomial! And Cooks
theorem is proved.
Steve Cook, a professor at the University of Toronto, presented his theorem on May 4, 1971 at the Symposium on the Theory of
Computing at the then Stouers Inn in Shaker Heights, Ohio. But his paper didnt have an immediate impact as the Satisfiability
problem was mainly of interest to logicians. Luckily a Berkeley professor Richard Karp also took interest and realized he could
use Satisfiability as a starting point. If you can reduce Satisfiability to another NP-complete problem than that problem must be
NP-complete as well. Karp did exactly that and a year later he published his famous paper on Reducibility among
combinatorial problems showing that 21 well-known combinatorial search problems that were also NP-complete, including, for
example, the clique problem.
Today tens of thousands of problems have been shown to be NP-complete, though ones that come up most often in practice
tend to be closely related to one of Karps originals.
In the next lesson, well examine some of the reductions that prove these classic problems to be NP-complete, and try to
convey a general sense for how to go about finding such reductions. This sort of argument might come in handy if you need to
convince someone that a problem isnt as easy to solve as they might think.
Weve shown that we can take an arbitrary problem in NP and reduce it to CNF satisfiabilitythat is what the Cook-Levin
theorem showed.
We also did another polynomial reduction, one of the Independent Set/Clique problem to the Vertex Cover problem. Again, Im
treating Independent Set and Clique here as one, because the problems are so similar. Much of this lesson will be concerned
with connecting up these two links into a chain.
First we are going to reduce general CNF-SAT to 3-SAT where each clause has exactly three literals. This is a critical reduction
because 3-SAT is much easier to reduce to other problems than general CNF. Then we are going to reduce 3-CNF to
Independent Set, and by transitivity this will show that vertex cover is NP-complete. Note that this is very convenient because it
would have been messy to try to reduce every problem in NP to these problems directly. Finally, we will reduce 3-CNF to the
Subset Sum problem, to give you a fuller sense of the types of arguments that go into reductions.
Here is our overall strategy. Were going to take a CNF formula f and turn it into a 3CNF formula f , which will includes some
additional variables, which well label Y. This formula f will have the property that for any truth assignment t for the original
formula f , t will satisfy f if and only if there is a way tY of assigning values to Y so that t extended by t Y satisfies f .
Lets illustrate such a mapping for a simple example. Take this disjunction of 4 literals.
(z1 z 2 z 3 z4 )
Note that the zi are literals here, so they could be x 1 or x1 , etc. Remember too that disjunction means logical OR; the whole
clause is true if any one of the literals is.
We will map this clause of 4 into two clauses of 3 by introducing a new variable y and forming the two clauses
(z 1 z 2 y) (y z3 z4 ).
Lets confirm that the desired property holds. If the original clause is true, then we can set y to be z1 z 2 . If z 1 z2 is true, then
this first clause will be true by itself, and y can satisfy the other clause. Going the other direction, suppose that we have a truth
assignment that satisfies both of these clauses. Suppose first that y is true. That implies that z1 z2 is true and therefore, the
clause of four literals is. Next, suppose that y is false. Then z 3 or z4 must true, and again so must the clause of four. If you have
understood this, then you have understood the crux of why this reduction works. More details to come.
(z 1 z 2 )
for these two clauses,
(z 1 z2 y1 ) (z 1 z2 y1 )
one with the literal y1 and one with the literal y1 . If in a truth assignment z1 z 2 is satisfied, then clearly both of these clauses
have to be true. On the other hand, taking a satisfying assignment for the pair of clauses, the y-literal has to be false in one of
them, so z 1 z2 must be true.
We can play the same trick when there is just one literal z 1 in the original clause. This time we need to introduce two new
variables, which well call y 1 and y 2 . Then we replace the original clause with four. See the chart below.
Note that if z 1 is true, then all of collection of four clauses are true too. On the other hand, for any given truth assignment for
z1 , y1 , and y2 that satisfies these clauses, the y-literals will have both be false in one of the four clauses. Therefore, z 1 must be
true.
Lastly, we have the case where k > 3 . Here we introduce k 3 variables and use them to break up the clauses according to
the pattern shown above.
Lets illustrate the idea with this example
( z1 z2 z 3 z 4 z5 z6 ).
Looking at this we might wish that we could have a single variable that captured whether any one of these first 4 variables were
true. Lets call this variable y3 . Then we could express this clause in as the 3 literal clause
(y z 5 z6 ).
At first, I said we wanted y3 would to be equal to z1 z2 z3 z 4 , but its actually sucient, and considerably less trouble, for
y3 to just imply it. This idea is easy to express as a clause.
(z 1 z 2 z3 z4 y )
Either y is false, in which case this implication doesnt apply, or one of z 1 through z 4 better be true. Together, these two clauses
imply the first one. If y is true, then one of z1 through z4 is, and so is the original. And if y is false, then z4 or z5 is true and so is
the original.
Also, if there original is satisfied, then we can just set y3 equal to the disjunction of z 1 through z 4 and these two clauses will be
satisfied as well.
Note that we went from having a longest clause of length 6 to having one of length 5 and another of length 3. We can apply this
strategy again to the clause that is still too long. We introduce another variable y2 and trade this long clause in for a shorter
clause and another one with three literals. Of course, we can then play this trick again, and eventually, well have just 3-literal
clauses.
The example is for k = 6, but an inductive argument will show that the argument works for any k greater than 3.
Weve seen how to do this transformation for a single clause, and actually this is enough. We can just transform each clause
individually introducing a new set of variables for each; all of the same argument about extending or restricting the truth
assignment will hold.
Lets illustrate what a transformation of multi-clause formula looks like with an example. Consider this formula here,
Thats how CNF can be reduced to 3CNF. This transformation runs in polynomial time, making the reduction polynomial.
We just reduced the problem of finding a satisfying to a CNF formula to the problem of finding a satisfying assignment to a
3CNF formula. At this point, its natural to ask, can we go any further? Can we reduce this problem to a 2CNF? Well no, not
unless P=NP. There is a polynomial time algorithm for finding a satisfying a 2-CNF formula based on finding strongly connected
components. Therefore, If we could reduce 3-CNF to 2CNF, then P=NP.
So 3CNF is as simple as the satisfiability question gets, and it admits some very clean and easy-to-visualize reductions that
allow us to show that other problems are in NP. Well go over two of these: first, the reduction to independent set (or clique) and
then to Subset Sum.
A the beginning of the lesson, we promised that we could link up these two chains and use the transitivity of reductions to show
that Vertex Cover and Independent Set were NP-complete. We now turn to the last link in this chain and will reduce 3CNF-SAT
to Independent Set. As weve already argued these problems are in NP, so that will complete the proofs.
Here then is the transformation, or reduction, we need to accomplish. The input is a 3CNF formula with m clauses, and we need
to output a graph that has an independent set of size m if and only if the input is satisfiable.
Well illustrate on this example.
\[(a \vee b \vee c) \wedge (\overline{b} \vee c \vee d) \wedge (a \vee \overline{b} \vee \overline{c}) For each literal in the formula,
we create a vertex in a graph. Then we add edges between all vertices that came from the same clause. We'll refer to these as
the within-clause edges or simply the clause edges. Then we add edges between all literals that are contradictory. We'll refer to
these as the between-clause edges or the contradiction edges.
The implementation of this transformation is simple enough to imagine, and I'll leave it to you to convince yourself that it can be done in polynomial
time. ###Which Edges Don't Belong - (Udacity) Here is a question to make sure you understand the reduction just given. Consider the formula below
and the associated graph. Indicate the edges would NOT have been output by the transformation just described.
###Proof that 3CNF INDSET - (Udacity, Youtube) Next, we are going to prove that the transformation just described does in fact reduce 3CNF
Satisfiability to Independent set. We'll start by arguing that if \\(f\\) is satisfiable then the graph \\(G\\) that is output has an independent set of size \\
(m,\\) the number of clauses in \\(f.\\) Let \]$t\[ be a satisfying assignment. In our example, let's take the one that makes \\(a\\) true, \\(b\\) false, \\(c\\)
false and \\(d\\) false, and well set the complements accordingly. Then we choose one literal from each clause in the formula that true under the truth
assignment to form our set S. Thus, in our example I might choose this the literals, as indicated by the circled T's in the figure below.
Clearly, the size of the set is \\(m,\\) the number of clauses. Because the vertices comes from distinct clauses there can't be any within-clause edges,
and because the truth assignment t doesn't contradict itself, there can't be an contradiction or between-clause edges either. Therefore, \\(S\\) must
be an independent set. Let's prove the other direction next. If graph \\(G\\) output by the proposed reduction has an independent set of size \\(m,\\)
then \\(f\\) is satisfiable. We start with an independent set of size \\(m\\) in the graph. Here, I have marked an independent set in our example graph.
The fact that there can be no within-clause edges in \\(S\\) implies that the literals in \\(S\\) must come from distinct clauses. The fact that there can
be no between-clause edges implies that the literals in \\(S\\) are non-contradictory. Therefore, any assignment consistent with the literals of \\(S\\)
will satisfy \\(f.\\) Here, the choice of literals implies that \\(a,\\) \\(b,\\) and \\(c\\) all be true, but \\(d\\) can be set to be true or false, and we still have a
satisfying assignment. So that completes the proof that Independent Set is as hard as 3CNF, and that completes the chain, \] \mbox{CNF } \leq_P
\mbox{ 3-CNF } \leq_P \mbox{ Independent Set } \leq_P \mbox{ Vertex Cover },$$
showing that both Independent Set and Vertex Cover are NP-complete.
Next, we are going to branch out both in the tree here and in the types of arguments well make by considering the subset sum
problem.
Imagine that you wanted to divide assets evenly between two partiesmaybe were picking teams on the playground, trying to
reach a divorce settlement, dividing spoils among the victors of war, or something of that nature. Then the question becomes,
Is there a way to partition a set so that things come out even, each side getting exactly 1/2?
In this case, the total is 18, so we could choose k to be equal to 9 and then ask if there is a way to get a subset to sum to 9.
Indeed, there is. We can choose the two 1s and the 7, and that will give us 9. Choosing 4 and 5 would work too, of course.
Thats the subset sum problem then. Note that the problem is in NP because given a subset, it takes polynomial time to add up
the elements and see if the sum equals k. Finding the right subset, however, seems much harder. We dont know of a way to do
it that is necessarily faster than just trying all 2m subsets.
1 through i 1,
a i using the first i 1, and then we include a i to get j.
For now, I just want you to give the the tightest valid bound on the running time for a Random Access Machine.
The collection of numbers will consist of two numbers for every variable in the formula: one that we include when t(x_i) is true
well notate those with y and another that we include when it is falsewell notate those with z.
In the end, we want to include either yi or z i for every i, since we have to assign the variable x i one way or the other. To get a
satisfying assignment, we also have to satisfy each clause, so well want the numbers yi and z i to reflect which clauses are
satisfied as well. The number y1 sets x1 to true, so we put a 1 in that column to indicate that choosing y1 means assigning x1 .
Since this assigns x 1 to be true, it also means satisfying clauses 1 and 3 in our example.
Choosing z1 also would assign the variable x1 a value, but it wouldnt satisfy any clauses. Therefore, that row is the same as for
y1 except in the clause columns where are all set to zero.
We do the analogous procedure for y2 and z2 : the literal x2 appears in clause 1, and the literal x2 appears in clauses 2 and 3.
We can do the same for variables x3 and x4 . These then are the numbers that we want to include in our set A.
It remains to choose our desired total k. For each of these variable columns the desired total is exactly 1. We assign each
variable one way or the other. For these clause columns, however, the total just has to be greater than 0. We just need one literal
in the clause to be true in order for the clause to be satisfied. Unfortunately, that doesnt yield a specific k that we need to sum
to.
The trick is to add more numbers to the table.
These all have zeros in the places corresponding to the variables, and exactly one 1 in a column for a clause. Each clause j gets
two numbers that have a 1 in the j th column. Well call them gj and hj .
This allows us to see the desired number to be 3 in the clause columns. Given a satisfying assignment, the corresponding
choice of y and z numbers will have at least a 1 in all the clause columns but no more than 3. All the 1s and 2s can be boosted
up by including the g and h numbers. Note that if some clause is unsatisfied, then include the g and h numbers isnt enough to
bring the total to 3 because there are only two of them.
Thats the construction. For each variable the set of number to choose from includes two numbers y and z which will
correspond to the truth setting, and for each clause it includes g and h so that we can boost up the total in the clause column to
three where needed.
This construction just involves a few for-loops, so its easy to see that the construction of the set of numbers is polynomial time.
First, we show that if f is satisfiable then there is a subset of A summing to k. Let t be the satisfying assignment. Then we
include yi in S if xi is true under t, and well include zi otherwise. As for the g and h families of numbers, if there are fewer than
three literals of clause j satisfied under t, then include gj . If there are a fewer than 2, then include hj as well. In total, the sum of
these numbers must be equal to k.
In the other direction, we argue that if there is a subset of A summing to k, then there is a satisfying assignment. Suppose that S
is a subset of A summing to k. Then the impossibility of any carrying of the digits in any sum implies that exactly 1 of yi or zi
must have been set to true. Therefore, we can define a truth assignment t where t(xi ) is true if yi is included in S and false
otherwise. This must satisfy every clause; otherwise, there would be no way that the total in the clause places would be 3.
Altogether, weve seen that Subset Sum in NP and we can reduce 3-CNF SAT an NP-complete problem to it, so Subset Sum
must be NP-complete.
Of course, there are many others. The point of this lesson, however, is not so that you can produce the needed chain of
reductions for every problem known to be NP-complete. Rather, it is to give you a sense for what these arguments look like and
how you might go about making such an argument for a problem that is of particular interest to you.
In the beginning of the course, we defined the general notion of a language. Then we began to distinguish between decidable
languages and undecidable ones. Remember that there were uncountably many undecidable ones but only countable decidable
ones. In comparison, the decidable ones should be infinitesimally small, but well give the decidable ones this big circle anyway
because they are so interesting to us.
In the section of the course on Complexity, we considered several subclasses of the decidable languages:
- P which consisted of those languages decidable in polynomial time time,
- NP, which consisted of the languages that could be verified in polynomial time (represented with the purple ellipse here, and
that includes P.),
- and then we distinguished certain problems in NP which were the hardest and we called these NP-complete. We visualize
these in this green band at the outside of NP, since if any one of these were in P, then P would expand and swallow all of NP (or
equivalently, NP would collapse to P).
In this section, we are going to focus on the class P, the set of polynomially decidable languages. The overall tone here will feel
a little more optimistic. In the previous sections of the course, many of our results were of a negative nature: No, you cant
decide that language, or No, that problem isnt tractable unless P=NP. Here, the results will be entirely positive. No longer will
we be excluding problems from a good class; we will just be showing that problems are in the good class P: Yes, you can do
that in polynomial time, or its even a low order polynomial and we can solve huge instances in practice.
It certainly would be sad if this class P didnt contain anything interesting. One rather boring language in P, for instance, is the
language of lists of numbers no greater than 5. Thankfully, however, P contains some much more interesting languages, and we
use the associated algorithms everyday to do things like sort email, find a route to a new restaurant, or apply filters to enhance
the photographs we take. The study of these algorithms is very rich, often elegant, and ever changing, as computer scientists all
over the world are constantly finding new strategies and expanding the set of languages that we know to be in P. There are nonpolynomial algorithms too, but since computer scientists are most concerned with practical solutions that scale, they tend to
focus on polynomial ones, as will we.
In contrast to the study of computability and complexity, the study of algorithms is not confined to a few unifying themes and
does not admit a clear overall structure. Of course, there is some structure in the sense that many various real-world
applications can be viewed as variations on the same abstract problem of the type we will consider here. And, there are general
patterns of thinking, or design techniques, that are often useful. We will discuss a few of these. And sometimes even abstract
problems are intimately related, when on the surface the may not seem similar. We will see this happen too. Yes, despite these
connections, in large part, problems tend to demand that they be solved by their own algorithm that involves a particular
ingenuity, at least if they are to yield up the best running time.
If the topic is so varied, how then can a student understand algorithms in any general way, or become better at finding
algorithms for his own problems? There are two good, complementary ways.
First, and most important, is practice. The more problems that you try to solve eciently, the better you will become at it and the
more perspective you will gain.
Second, is to study the classic and most elegant algorithms in detail. Here, we will do this for a few problems often not covered in
undergraduate courses.
Even better is if you can combine the two approaches. Dont just follow along with pen and paper. Pause and see if you can
anticipate the next step and work ahead both on the algorithm and the analysis. Keep that advice in mind throughout our
discussion of algorithms.
Well at first, this feels like an ideal case for recursion. Since the subproblems are similar, perhaps we can use the same code
and just change the parameters. Starting from the hard problem on the right, we could recurse back to the left. The diculty is
that many of the recursive paths visit the same nodes, meaning that such a strategy would involve wasteful recomputaion.
This is sometimes called one of the perils of recursion, and its often illustrated to beginning programmers with the example of
computing the Fibonacci sequence. Each number in the Fibonacci sequence is the sum of the previous two with the first two
numbers both being 1 to get us started. That is,
F 0 = 1, F 1 = 1 and F n = F n1 + F n2
This hard problem of computing the nth number in the sequence depends on solving the slightly easier problems of computing
the (n 1) th and the (n 2) th elements in the sequence.
Computing the (n 1) th element depends on knowing the answer for n 2 and n 3 . Computing the (n 2) th depends on
know the answer for n 3 and n 4 and so forth.
Thinking about how the recursion will operate, notice that well need to compute Fn2 once for Fn and once for Fn1 , so theres
going to be some repeated computation here, and its going to get worse the further to the left we go.
How bad does it get? The top level problem of computing the n th number will only be called once and he will find the problem
of computing the (n 1) th number once.
Computing n 2 needs to happen once for each time that the two computations that depend on it are called: once for n 1
and once for n.
Similarly, computing n 3 needs to happen once for every time that problems that depend on it are called, so it gets called two
plus one for a total of three time.
Notice that each number here is the sum of the two numbers to the right, so this is the Fibonacci sequence, and the number of
times that the (n kk) th number is computed will be equal to the k th Fibonacci number, which is roughly the golden ratio raised
to the k th power,
There are two ways to cope with this problem of repeated computation.
One is to memoize the answers to the subproblems. After we solve it the first time, we write ourselves a memo with the answer,
and before we actually do the work of solving a subproblem we always check our wall of memos to see if we have the answer
already.
Alternatively, we can solve the subproblems in the right order, so that anytime we want to solve one of the problems we are sure
that we have solved its subproblems already.
The latter can always be done, because the dependency relationships among the subproblems must form a directed acyclic
graph. If there were a cycle, then we would have circular dependencies and recursion wouldnt work either. We just find an
appropriate ordering of the subproblems, so that all the edges go from left to right and then we solve the subproblems in the left
to right order. This is the approach well take for this lesson. It tends to expose more of the underlying nature of the problem and
to create faster implementations than using recursion and memoizing.
In general, we say that we are given two sequences X and Y over some alphabet and we want to find an alignment between
themthat is a subset of the pairs between the sequences so that each each letter appears only in one pair and the pairs dont
cross, like in the example above.
For completeness, we give the formal definition of an alignment, though for our purposes the intuition given above is sucient.
An alignment is a subset A of
ii
j j,
In total
c(A) = n + m 2|A| +
(i,j)A
(x i , yj ).
The problem then is to find an alignment of the sequences that minimizes this cost.
First, suppose that we match the last two characters of the sequences together. Then the cost would be the minimum cost of
aligning the prefix from m 1 of X and the prefix through n 1 of Y, plus the cost associated with matching the last two
characters together.
Another possibility is that we leave the last character of the X sequence unmatched. Then the cost would be the minimum cost
of aligning the prefix from m 1 of X and the prefix through n of Y, plus 1 for leaving Xm unmatched.
And the last case where we leave Yn unmatched instead is analogous.
Of course, since c(m, n) is defined to be the minimum cost aligning the sequences, it must be the minimum of these three
possibilities. Notice, however, that there was nothing special about the fact that we were using n and m here, and the same
argument applies to all combination of i,j. The problems are similar. Thus, in general,
Knowing C(3,3) the cost of aligning the full sequence depends on knowing C(3,2) C(2,2) and C(2,3).
Knowing C(2,3) depends on knowing C(2,2), C(1,2), and C(1,3), and indeed in general, to figure out any cost, we need to know
the cost north, west and northwest in this grid.
These dependencies form a directed acyclic graph, and we can linearize them in a number of ways. We might in scanline order,
left to right, or even along the diagonals. Well keep it simple and do a scanline order. First, We need to fill in the answers for the
base cases, and then, its just a matter of passing over the grid and taking the minimum of the three possibilities outlined earlier.
Once weve finished, we can retrace our steps by looking at west, north, and northwest neighbors and figuring out what step we
took, and that will allow us to reconstruct the alignment.
Dynamic programming always relies on the existence of some optimal similar substructure. In this case, it was the minimum of
cost aligning prefixes of the sequences that we wanted to align. The key recurrence being that the cost of aligning the first i
characters of one sequence with the first j other other was the minimum of the costs of matching the last characters, leaving the
last character of the first sequence unmatched, or the cost of leaving the last character of the second sequence unmatched.
Next, recall that matrix multiplication is associative. Thus, as far as the answer goes it doesnt matter whether we multiply A and
B first and then multiply that product by C, or if we multiply B and C first and then multiply that product by A.
(AB)C = A(CB)
The product will be the same, but the number of operations may not be.
Lets see this with an example. Consider the product of an 50 20 matrix A with a 20 50 matrix B with a 50 1 matrix C.,
and lets count the number of multiplications using both strategies.
If we multiply A and B first, then we spend 50 20 50 on computing AB. This matrix will have 50 rows and 50 columns, so its
product with C will take 50 50 1 multiplications for a total of
50 20 50 + 50 50 1 = 52, 500.
On the other hand, if we multiply B and C first, this costs 20 50 1. This produces a 20 1 and multiplying it by A costs
50 20 1 for total of only
20 50 1 + 50 20 1 = 2000
a big improvement. So, its important to be clever about the order we choose for multiplying these matrices.
This suggests that we try all possible binary trees and pick the one that has the smallest computational cost.
Starting from the top level, we would consider all n 1 ways of partitioning into left and right subtrees, and for each one of
these possible partitions, we would figure out the cost of computing the subtrees and then multiplying them together.
To figure out the cost of the subtrees themselves, we would need to consider all possible partitions into left and right subtree
and so forth.
More precisely, let C(i, j) be the minimum cost of multiplying Ai through A j . For each way of partitioning this range, the cost is
the minimum cost of computing each subtree, plus the cost of combining them, and we just want to take the minimum over all
such partitions.
The base case is the product of just one matrix, which doesnt cost anything.
C(i, i) = 0.
Clearly, we will have to compute the minimum cost of multiply (ABC) in the left problem. But we are going to have to compute it
on the right as well, since we need to consider pulling D o from the (ABCD)chain as well. We end up re-doing all of the work
involved in figuring out how to best multiply ABC over again. As we go deeper in the tree, things get recomputed more and
more times.
Fortunately for us, there are only n choose two subproblems, so we can create a table and do the subproblems in the right
order.
The entries along the diagonal are base cases which have cost zero. A product of one matrix doesnt case anything. Our goal is
to compute the value in the upper right corner C(1, n).
Consider which subproblems this depends on. When the split reprsented by k is k = 1 , we are considering these costs C(1, 1)
and C(2, n). When k = 2, we consider the problems C(1, 2) and C(3, n) . In general, every entry depends on all the elements
the left and down in the table.
There are many ways to linearize this directed acyclic graph of dependence, but the most elegant is to fill in the diagonals going
towards the upper right corner. In the code, we have let s indicate which diagonal we are working on. The last step, of course, is
just to return this final cost.
The binary tree that produced this cost can be reconstructed from the k that yielded these minimum values. We just need to
remember the split we used for each entry in the table.
The optimal similar substructure that we exploited was the minimum cost of evaluating the subchains the key recurrence saying
that the cost of each partition is the cost of evaluting each part plus the cost of combining them, and of course, we want to take
the minimum cost over all such partitions.
For graphs with negative weights, the standard single source solution is the Bellman-Ford algorithm, which takes time O(VE) .
Now we can run these algorithms multiple times, once with each vertex as the source. If we visualize the problem of finding the
shortest path between all pairs as filling out this table. Each run of Dijkstra or Bellman-Ford would fill out one row of the table.
We are running the algorithm V times, so we just add a factor of V to the running times.
For the case where we have negative weights on the edges and the graph is dense, we are looking at a time that is V to the
fourth power here. Using dynamic programming, were going to be able to do better.
By the way, throughout Ill use the squiggly lines to indicate a path between two vertices and a straight line to indicate a single
edge.
Unfortunately, by itself this substructure is not enough.
Sure, we might argue that a shortest path from u to v take a shortest path from u to a neighbor of v first, but how do we find
those shortest paths. The subproblems end up having circular dependencies.
One idea is include the notion of path length defined by the number of edges used. If we knew all shortest paths that only used
k 1 edges, then by examining the neighbors of v, we could figure out a shortest paths with k edges to v. Well let (u, v, k) be
the weight of the shortest path that uses at most k edges. Then the recurrence is that (u, v, k) is the minimum of
3
This strategy works, and it yields the matrix multiplication shortest paths algorithm that runs in time O(V logV). See CLRS for
details.
We are going to take a dierent approach that will yield a slightly faster algorithm and allow us to remove that log factor.
Were going to recurse on the set of potential intermediate vertices used in the shortest paths.
Without loss of generality, well assume that the vertices are 1 through n for convenience of notation, i.e. V = {1, , n}.
Consider the last step of the algorithm, where we have allowed vertices 1 through n 1 to be intermediate vertices and just
now, we are considering the eect of allowing n to be an intermediate vertex. Clearly, our choices are either using the old path,
or taking the shortest path from u to n and from n to v. In fact, this is the choice not just for n, but for any k. To get from u to v
using only intermediate vertices, 1 through k, we can either use k, or not.
Therefore, we define (u, v, k) to be the minimum weight of a path from u to v using only 1 through k as intermediate vertices.
Then the recurrence becomes
w(u, v)
(u, v, 0) =
if (u, v) E
otherwise
As you might imagine, we are going to fill out a table. The subproblems have three indices, but thankfully only two will be
required for us. Well create a two dimensional table d indexed by the source and destination vertices of the path. We initialize it
for k = 0, where no intermediate vertices are allowed.
Then we allow the vertices one by one. For each source, destination pair, we update the weight of the shortest path accounting
for the possibility of using k as an intermediate vertex. Note, that when i or j is equal to k the weight wont change since a
vertex cant be useful in a shortest path to itself. Hence we dont need to worry about using an outdated value in the loop.
To extract not just the weights of the shortest paths but also the paths themselves, we keep a predecessor table that contains
the second to last vertex on a path from u to v.
Initially, when all paths are single edges, this is just the other vertex on the incident edge. In the update phase, we either leave
the predecessor alone if we arent change the path, or we switch it to the precessor of the k to j path if the path via k is
preferable.
Given these preferences, we would like to be able to infer that this individual prefers the NBA over the NFL as well.
In eect, if there is a path from one vertex to another, we would like to add a direct edge between them. In set theory, this is
called the transitive closure of a relation. Given what we know already, there is a fairly simple solution: just give each edge
weight 1 and run Floyd-Warshall! The distance will be infinity if there is not path and it will minimum number of edges traversed
otherwise.
This is more information that we really need, however. We really just want to know whether there is a path, not how long it is.
Hence, in this context we use a slight variant where the entries in the table are all booleans 0 or 1, instead of integers, but
otherwise, the algorithm is essentially the same.
We initialize the table so that entry i, j is 1 if (a i , a j ) is in the relation and zero otherwise. Note that Im letting a 1 through a n be
the set A here. Then we keep on adding potential intermediary elements and updating the table accordingly. We have that i and
j are in the relation if either they are in the relation already or they are linked together by a k .
Often, we are interested not in the transitive closure of a relation but in the reflexive transitive closure. In this case, we just set
the diagonal elements of the relation to be true from the beginning.
ck =
ai bki
i=0
We can visualize convolution by reversing b and lining it up with a so that zeroth element element of b is under the k th element
of a. This in the alignment for k = 0.
Then, we multiply all elements that overlap and add up all these products. For k = 0, this is just 2 1 = 2. Therefore, c0 = 2.
For k = 1, we slide the reversed b sequence one unit to the right and perform the analogous calculation c1 = 0 1 + 2 0 = 0.
We continue slide b along and doing these sums until there is no more overlap left.
Convolution has many applications, but the one that will be most convenient for us to talk about is multiplying polynomials.
Given the coecients of two polynomials, we can find the coecients of the product just by convolving the two sequences of
coecients.
In fact, we can easily repeat the example we just did but in the context of polynomial multiplication. Once the sequence b is
reversed multiplying corresponding elements gives all the terms with a given power on the exponent of the variable x . For
example, this alignment calculates all the x 2 terms and yields the coecient c2 .
How long does this process take? Well for each element in the longer sequence, we had to do as many multiplications and
additions as there are elements in the shorter sequence. Sometimes, it was a little shorter around the edges, but on average it
was at least half this length. Therefore, we can say that convolving two sequences via the naive strategy outlined here takes
(nm) operations, where n and m are the lengths of the two sequences. The Fast Fourier Transform will give us a way to
improve upon this.
So far, weve assume that polynomials are represented by their coecients. For example, A(x) = 2 + x + x . If you have
worked with polynomial interpolation or fitting before, however, you will know that an order n polynomial is uniquely
characterized by its values at any n points. (The order of a polynomial by the way, the is number of coecients used to define it
or the degree plus one.) Hence, we might just as well represent a polynomial by its values at a sequence of inputs as by its
coecients. For example, that same polynomial could be represented by saying that A(1) = 2, and A(0) = 2 and
A(1) = 0.
Going from the coecient representation can be thought of as matrix multiplication. To calculate A at some value, I take the dot
product of the corresponding row of the matrix consisting of the powers of x of the argument, with a column vector consisting
of the coecients of A.
This matrix where the rows are geometric progressions of values xi a is important enough that it gets its own name and is called
a Vandermonde matrix. Its determinant is the product of the dierences of all of the values for x.
det(V) = ( i j )
det(V) = ( xi xj )
1i<jn
As long as these are distinct, the matrix is invertible and we can recover the coecients given the values!
Note that I did have to start with a number of points that was equal to the order of the product here.
The fact that multiplying in the value representation is so much faster suggests that it might make sense to convert to the value
representation, do the multiplication there, and then interpolate back to the coecient representation.
Well visualize the process like this. First we convert from the coecient representation to the value representation.
Then we multiply the corresponding values to get the values of C. Then we interpolate to get back the coecients of the
product via interpolation.
The most ecient way to do this is via Horners Rule. You can also think about filling out the Vandermonde matrix and then
doing the matrix multiplication. Regardless, for arbitrary points we end up with a quadratic running time.
Multiplying in the value domain takes order n+m time, since we just multiply values for corresponding inputs x_j. This was fine.
Interpolation involves solving a system of equation with n+m equations and n+m unknowns. By Gaussian elimination this would
3
take O((n + m) ) for the worst case. There is also a method called Lagrange interpolation that allows us to do this time that is
just quadratic.
Is there any hope? Well, yes there is. All of these running times pertain to an arbitrary set of points, but since we are only
interested in the coecients of C, we get to choose the points! As it turns out this freedom is very powerful.
As you did the calculation for the exercise, you may have taken advantage of the fact that the input values were arranged in
positive-negative pairs. For higher order polynomials, this advantage becomes greater. All the even terms are the same for x and
and the odd terms are just the negatives of each other.
Lets define Ae to be the polynomial whose coecients are the even coecients of A,
A e (x) = a 0 + a 2 x + a4 x 2 +
and define Ao to be the polynomial whose coecients are the odd coecients of A
A o (x) = a 1 + a 3 x + a5 x 2 +
Then we can write
2
2
A(x) = Ae (x ) + x Ao (x )
and
2
A(x) = A e (x ) xA 0 (x ).
We get two points for the price of one!
More formally, lets say that we choose xi such that x i = xi+N/2 for i {0, , N/2 1}. Then we can compute the values
2
two at a time by computing Ae and A o at x and using them in these equations.
Overall, weve changed the problem from evaluating a polynomial of order N at N points to evaluating two polynomials of half
the order at half the number of points. This is good, but at best, weve only reduced the running time by a constant factor. We
need to be able to apply this strategy recursively to change the asymptotic running time. A set of points that would allow us to
do that would be very special indeed.
In order to be able to do this computation recursively, we need the sequence x to have the following properties:
first, they should all be distinct (xj
xj unless j = j ). Otherwise, our eorts will be wasted and we wont have enough points to
xj+N/2 = x j .
lastly, we want all of these properties to apply to the squares of these numbers, so that we can use the trick again recursively.
If your polynomials are over some unusual field, then it may make sense to choose an unusual set of values for x. For most
applications, however, the coecients will be integers, or reals, or complex numbers and the choice of x will be the complex
roots of unity.
We define
N = e2i/N
and we let
xj = jN
Lets visualize these points in the complex plane for N = 8.
All of the points have magnitude one, so they will be arranged around the unit circle, and the angle from the positive real axis
will be determined by the exponent. Thus, omega to the j th power will be j/N of the way around the unit circle.
Lets confirm that all the desired properties hold. Indeed the points are unique, as j is always less than N, so there is no wrap
around.
The symmetric property holds because adding N/2 to j corresponds to an increase in the exponent by . This has the eect of
increasing the angle by half the circle or equivalently, multiplying by negative one.
The recursive property is the hardest to confirm. Notice, however, that for all of points that have odd powers in the exponent,
squaring these numbers makes the exponent even. Thus, 3 when squared becomes 6 . The point 5 becomes 10 , which
2
wraps around and becomes .
Moreover, each of the even powers is the square of exactly two of the other points. Which points? Just divide the exponent by
2
two. That gives you one. For example, for 4 , its . Where is the other point? On the opposite side of the circle, of course,
6
. The additional N/2 in the exponent comes an additional N when the point is squared, meaning that it maps to the same
place.
The result of all of this is that when we square these numbers, the odd powers of omega go away.
Once we are left with only the odd powers, however, it doesnt make sense to express these points in terms of 8 any more. We
end up with the 4th roots of unity instead of the 8th roots. This same logic applies to any N where N is a power of two.
It is worth noting here how few of the properties of complex numbers were necessary for the recursion we needed. In fact, there
are number theoretic algorithms that use modular arithmetic and avoid the diculty with precision associated working with
complex numbers.
To compute Ae and Ao at the 2nd roots of unity, we apply the same strategy again. First, rewriting them in terms of the even
and odd coecients, and recycling as much of the computation as possible.
Each of the two previous problems has been reduced to evaluating an order one polynomial at one point. But this is trivial, as it
only involves the constant term. The upward pass of the recursion then fills in all these intermediate values, eventually giving us
the values of A at the fourth roots of unity.
We'll state this as a recursive algorithm, and the base case is where \N is equal to one, in which case we just return the single element sequence. If
N > 1, then we call the FFT recursively once with the even coecients and once with the odds. Then we combine the results, taking care of paired
values together. Notice the dierence in sign on the contribution from the odd powers.
How long does this take? We traded one problem of size N for two problems of size N/2, plus theta N work for all the arithmetic
in this loop. That is,
There is one other wrinkle that I want to add to the algorithm, and that is to say that this parameter here can be any primitive
N th root of unity. The real key is only that its N powers all be roots of unity. We can add omega as a parameter to the algorithm.
This will come in handy later.
This network is called the butterfly network because these connections over here on the left look a little like a butterfly.
Also note that there is a unique left-to-right path between all nodes on the right and those on the left.
Another thing to note is that this sequence of even odds on these polynomials can be translated to binary. Thus, ooe becomes
110. Under this transformation, these numbers indicate which coecient of the original polynomial gets returned. Even
corresponds to grabbing the numbers with zero in the lowest order bit, odd corresponds to grabbing the numbers with 1 in the
lowest order bit.
It will also be instructive to write out the power on in the value here on the right-hand-side. It turns out that these numbers on
the left act as instructions for any path to this node from any node on the right.
These numbers on the right act as instruction for how get here from any node on the left.
We have an N log N way to evaluate the polynomials. We can multiply them in the value representation easily in N time. But the
interpolation remains a problem. Remember that this runtime involved solving a system of equations involving the Vandermonde
matrix.
2
Each row corresponds to the powers of a value. The powers of 1 are all 1. The powers of are 1, , , . The next value is
2
4
2 , so its powers are 1, , , .
M N ()[k][j] = (k1)(j1)
This matrix has some very special properties. For our purposes, however, the key one can be summarized with the following
claim.
Let be a primitive N th root of unity. Then
M N ()M N (1 ) = NI.
For the proof, consider element kj of this product. This will be the sum over of the corresponding powers of (k1) and (j1) .
That is,
N1
(k1) (j1)
=0
=0
(kj) .
Now, if k = j , then every term is 1 and the sum is N. Otherwise, we recognize this as a geometric series and rewrite it as the
ratio
1 (kj)N
= 0 when k j
1 (kj)
Raising any root of unity to the N th power is just 1, so this expression is zero when k j Thus, we have that entry kj is N if
k = j, and 0 otherwise, proving the claim.
This claim is terribly important. Recall that evaluating a polynomial at the roots of unity corresponded multiplying the coecients
by the matrix MN (), and we used the FFT to do that. Now, we see that we can multiply these values the inverse of this matrix
also using the FFT to allow us to recover coecients given the values! This was why it was key that the FFT work with any root
of unity.
This realization about the Vandermonde matrix leads us to the inverse fast Fourier transform. We are given the values of a
polynomial at the roots unity, and we want to recover the coecients of the polynomial. The algorithm is fantastically simple
given what weve established so far. Just run the regular FFT only passing in the inverse of the root of unity that we used the
first time. Then divide by N.
Recall that the values that we received as input were equal to the Vandermonde matrix times the coecients. By multiplying the
vector of these values by the conjugate of Vandermonde matrix via the FFT with omega inverse, we end up with N times original
coecients. Hence, we just need to divide by N to recover the original coecients.
Remember that polynomial multiplication was just a convenient way to think about the more general problem of convolution.
Therefore, in general, convolving an n long sequence with an m long sequence need only take time O(n + m log(n + m)), a
remarkable and truly powerful result.
In this first lesson, we discuss the problem of finding a maximum flow through a network. A network here is essentially anything
that can be modeled as a directed graph, and by flow, we mean the movement of some stu through this mediumfrom a some
designated source to a destination. Typically we want to maximize the flow. That is to say, we want to route as many packets as
possible from one computer across a network to another, or we want to ship as many items from our factory to our stores. Even
if we dont actually want to move as much as possible across our network, understanding the maximum flow can give us
important information. For instance, in communication networks, it tells us something about how robust we are to link failures.
And even more abstractly, maximum flow problems can help us figure things that seem unrelated to networks, like which teams
have been eliminated from playo contention in sports.
Given the variety and importance of these applications, its should be no surprise that maximum flow problems have been
studied extensively, and we have some sophisticated and elegant algorithms for solving them. Lets dive in.
+
The flow itself is a function from pairs of vertices to the nonnegative integers, say f : V V Z . By this definition, the flow
must be nonnegative. It also cant exceed the capacity for any pair of vertices. That is to say,
f in (v) =
uV
f (u, v)
out
= f (v, u)
f out =
f (v, u)
uV
we have 4 + 2 = 6 units of coming in and 5 + 1 = 6 units of flow going out. Intuitively, this just means the internal nodes cant
generate or absorb any of the stu thats flowing. Those are the jobs of the source and the sink.
The overall value of the flow is defined as the flow going out of the source or equivalently the flowing going into the sink.
uV
One such limitation is the need for all of the capacities on the edges to be integers. We can extend all our arguments to all
rational capacities just by multiplying the capacities by the least common multiple of denominators to create integer capacities.
This just amounts to a change of units in our measurement of flow.
Another limitation weve imposed is that no antiparallel edges are allowed in the network. This forces us to choose a direction
for the the flow between every pair of vertices. In general, however, it might not be clear in which direction the flow should go
before solving the maxflow problem. Its possible to cope with this problem with some slightly less elegant analysis, or just to
convince yourself that the theorems still hold, you can add an artificial vertex between the the two nodes and route the reverse
flow through there.
Another limitation of our model is that weve limited ourselves to single source, single sink networks. At first it might seem that
we couldnt handle a network like this one which has two sources and three sinks. Actually, however, this situation is quite easy
to deal with. We can just add an artificial source node and connect that to the others with an infinite capacity and similarly add
an artificial sink.
Perhaps, we realize that we increase the flow by routing it through this just the top and bottom paths and leaving out the middle.
This is equivalent to adding a flow that goes like the one shown in orange.
Notice that the flows through the middle cancel out. By adding this flow to original, we get the desired result.
Alternatively, if we just wanted to re-route the flow through the top link we could add a circular flow which would then re-route
the flow.
In fact, all possible changes that we might make to our original flow can be characterized as another flowonly dierent rules
apply. Certainly, if weve used up some of the capacity on an edge, we cant use the full capacity in the flow were going to add.
We capture the rules for the flow that we are allowed to add with the notion of a residual flow network. We start by defining the
residual capacity for all pairs of vertices.
If there is an edge between the pair, then the residual capacity is just however much capacity is left over after our original flow
uses up some of it. For reverse edges, the capacity is however much flow went over the edge in the opposite direction. We can
just unsend this much flow. Everywhere else the residual capacity is zero.
The edges of our residual network are just the places where the residual capacity is positive. Keeping the network sparse helps
in the analysis.
in the residual network. Lets see how this works. Well start with a flow f in G. Then drawing the residual network, well let f
be a flow there.
Of course, it obeys the residual capacity contraints. Then, we can add these two flows together using this special definition.
(f + f )(u, v) =
if (u, v) E
otherwise
Note only one of the values from f can be positive in the (u, v) E case.
+ f is a flow in the original network G and that its value is just the sum of the values of the two
individual flows.
This is pretty easy to verify with the equations, but its not very illuminating, so well skip it here and focus on the intuition
instead. Is this augmented flow a flow in original network G ? It fits within the capacity constraints, essentially by construction of
the residual capacities. And it and conserves flow because both f and f do. So, yes it is a valid flow.
The flow out from the source is a simple sum, which makes it a linear function, so yes it makes sense that the flow value of the
sum should be the sum of the flow values as well.
We begin by initializing the flow to zero. Then, while there is a path from the source s to the sink t in the residual graph, we are
going to calculate the minimum capacity along such a path and then augment with a flow that has this value along that path.
Once there are no more paths in the residual graph, we just return the current flow that weve found f .
Lets see how this works on a toy example. [See Video]
Does the algorithm terminate? Remember that the capacities are all integral, so each augmenting flow f has to have value at
least 1; otherwise, the path wouldnt be in the graph. Therefore, we cant have more iterations than the maximum value for a
flow. So, yes it terminates.
How much time does it spend per iteration? Finding a path can be done with breadth-first search or depth-first search in time
proportional to the number of edges. Constructing the residual graph also takes this amount of time, as it has at most twice the
number of edges of the original graph. And, of course updating the flow requires a constant amount of arithmetic per edge. Allin-all then, we have just O(|E|) time for each iterationthat is time proportional to the number of edges.
This is a good start for the analysis, but it leaves us with some unanswered questions. Most important, perhaps, is whether the
returned flow is a maximum flow. Sure its maximal in the sense that we cant augment it any further, but how do we know that
with a dierent set of augmenting paths or perhaps with an entirely dierent strategy altogether that we wouldnt have ended up
with a greater flow value?
Also, this bound on the number of iterations is potentially exponential in the size of the input, leaving us with an exponential
algorithm. Perhaps, there is some way to improve the analysis or revise the algorithm to get better runtime.
These two questions will occupy the remainder of the lesson. Well start with showing that Ford-Fulkerson does indeed produce
a maximum flow and then well see about improving the running time.
Its possible that there will be smaller cuts as well, which will shrink the range of possible flow values for us.
Even better, however, well see that the flow produced by Ford-Fulkerson has the same value as a cut, and since this cut serves
as an upper bound on all possible flows, this flow must be a maximum. In other words, the two arrows in the diagram will meet
at the value of the flow produced by Ford-Fulkerson. Thats were this argument will end up.
(A, B) of the vertices such that the source s is in the set A and the sink t is in the set B .
For example, in this network here, the green nodes might be in A and the orange ones in B .
The vertices within one side of the partition dont have to be connected in the definition.
Next, we observe that if f is an s-t flow and (A, B) is an s-t cut, then the value of the flow is the flow out of A minus the flow into
A , or equivalently, the flow into B minus the flow out of B .
v(f ) = f
out
(A) f
in
in
out
(A) = f (B) f
(B).
For this first cut shown above, we have 2 + 6 entering and 2 leaving for a total of 6. For the second cut shown, we have
1 + 1 + 5 + 1 units exiting A and 2 units entering for a total of 6, which is indeed the flow.
As you might imagine the proof comes from the conservation of flow. We start with the definition of the flow and then add a zero
in the form of the conservation equation for each node in A .
For every edge where both vertices are in A terms cancel, leaving us with just the outgoing edges from A to B and the incoming
edges from B to A. And these sums then are just the flows out and flows into A as stated by the theorem. Check this for
yourself with the equations.
c(A, B) =
uA,vB
c(u, v).
v(f ) = f
out
(A) f
out
in
(A).
(A).
out
(A).
We can just drop this second second term, leaving us with just f
v(f ) f out (A) =
u A, v Bf (u, v).
This is bounded by the capacities on the edges crossing the cut, and this sum is then the capacity of a cut.
v(f )
u A, v Bf (u, v)
Note from this proof that the inequality will be tight when there is no flow re-circulating back into A and all the crossing edges
are saturatedi.e. the flow is equal to the capacity. Keep this in mind.
f is a maximum flow in G.
2. G f has no augmenting paths.
3. There exists an s -t cut (A, B) such that v(f ) = c(A, B).
1.
This is the realization of the strategy outlined earlier, where were going to introduce the notion of the cut, show that it served as
an upper bound the on flow, and then show that Ford-Fulkerson produced a flow with the same value as a cut.
Lets see the proof of the Max-Flow Min-Cut theorem. We start by showing that if f is a maximum flow in a network, then it has
no augmenting paths in the residual network. Well, suppose not, and let f be an augmenting flow.
Then we can augment f by f and the flow of the sum will be the sum of the flows. The value of the augmenting flow is positive,
so we have created a new, greater flow contradicting the the fact that f was a maximum flow.
Well make the next step in the proof an exercise. If (u, v) goes from A to B, what does that implies about the flow. Similarly, if it
goes from B to A, what does that imply about the flow. Answer both questions.
Note that there are no path from s to t in the residual graph. The vertices that can be reached from s are marked as green and
the other ones as orange. Its easy to see that all the edges from the green to the orange vertices are saturated, and the edges
from the orange to the green are empty, just like the theorem claims.
Recall that for any cut the value of a flow is the flow out of the partition with the source minus the flow into the partition with the
source.
As, weve just argued, however, in this case there is no flow back into the source partition. Moreover, the flow out saturates all
the edges, so its just the sum of the capacities across the cut, which is then defined as the cut capacity.
v(f ) =
u A, v Bf (u, v) =
If you followed all of that, congratulations! The max-flow/min-cut theorem is one of the classic theorems in the study of
algorithms and a wonderful illustration of duality, which well discuss in a later lesson. For now, however, we are not quite ready
to leave maximum flow yet.
Note that when is 1 this is equivalent to the traditional residual network. Then we can state the algorithm as follows. We
initialize the flow to be zero and the parameter Delta to be this trivial upper bound on the flow in a single path.
Then while 1, we look for an augmenting path p in G f () and use it to augment the flow. Once all such paths are
exhausted, we cut in half.
Some of the analysis, we can do just by inspection. Letting C be the initial value for , we see that we use O(log C) iterations
of the outer loop.
The steps of the inner loop cost only O(|E|) time.
The big question then is how many iterations of this inner loop are there?
The proof will feel a lot like the max-flow min-cut proof. We let A be the set of vertices reachable from s in G f () and we let B
be the complement of this set in V . Edges from A to B in this graph must have residual capacity at most 1 ,
(u, v) B A f (u, v) 1.
The flow is then
v(f ) =
f (u, v)
c(u, v) ( 1)
(u,v)AB
(u,v)AB
(u,v)BA
f (u, v)
( 1)
(u,v)BA
The base case where = C is trivial. Here there can be only at most one iteration. For subsequent iterations, we let f be the
flow after a phase of augmentations and let g be the flow before, which would be after a phase of 2 or or a phase of
2 + 1 augmentations.
The flow f is a most the maximum flow, but this is at most the capacity of the s-t cut induced by the graph G f (2 + 1) from the
previous iteration. Our lemma then says
Now let k be the number of iterations used to go between g and f . Each iteration increased the flow by at least , so
The cost of an iteration is O(|E|) as always, as we just use bread-first-search to the shortest s-t path. If we can bound the
number of iterations, most tightly well have a better found than for the naive Ford-Fulkerson. Indeed we will be able to bound it
to be at most |V||E|, showing that
The Edmonds-Karp returns a maximum flow in
Three edges have been deleted because they go between vertices of the same level or back up a level.
We first observe that augmenting along a shortest path only creates paths that are longer ones. Say that we push flow along the
topmost path as shown below.
This introduces back edges along the path. Note, however, that these back edges are useless for creating a path of the same
length. In fact, because they go back up a level, any path that uses one of them must use two more edges that the augmenting
path we just used.
Next, we observe that the shortest path distance must increase every |E| iterations. Every time that we use an augmenting path
we delete an edge from the level graph, the one that got saturated by the flow.
From our example, suppose that the middle edge was saturated in an augmentation along this path.
This edges wont come back into the level graph until the minimum path length is increased. As weve already argued the
reverse edges are useless until we are allowed to use longer paths.
Lastly, there are only |V| possible shortest path lengths, so that completes the theorem.
2
Notice that with the Edmonds-Karp time bound of O(E V), weve eliminated the dependence of the running time on the
capacities. This means that the algorithm is now strongly polynomial, and actually we can eliminate the requirement that the
capacities be integers entirely.
As with all augmenting flow-based strategies, we start with an initial flow of zero. Then we repeat the following. We build a level
graph from the residual flow network, and let k be the length of a shortest path from s to t. Then while there is a path from
source to sink that has this length k, we use it to augment the flow and then update the residual capacities.
Then we repeat this until there are no more s -t paths and we return the current flow.
Turning to the analysis, well call one iteration of this outer loop a phase, and well be able to argue that each phase increases
the length of the shortest s- t path in the residual network by 1. The principle here is the same as for Edmonds-Karp: augmenting
by a shortest path flow doesnt create a shorter augmenting flow. Hence, once we have exhausted all paths of a given length,
the next shortest path must be one edge longer. Hence there only O(|V|) phases.
Within a phase the level graph is built by bread-first-search so that costs only O(|E|).
The hardest part of the argument will be that this loop altogether takes only O(EV) time. Well see this argument in a second.
O(|E||V 2 )
O(|E 2 |V|)
2
2
Altogether, we will show that Dinics algorithm takes O(|E||V | ) time which is an improvement upon Edmonds-Karp O(|E| |V|)
time.
If it doesnt, then we delete the last vertex in the path from the level graph.
In this example, we would first find a path from from s to t, and maybe the middle edge is the bottleneck. Its capacity gets set
to zero and it gets deleted.Next, we would build a path again from s, and this time, we would run into a dead-end. So we delete,
this vertex and continue.
There are only V vertices in the graph, so we cant run into more than V dead-end, and every augmentation deletes the
bottleneck edge and we cant delete more than E edges. Overall, then we wont try more than E paths.
The process of building these paths and augmenting flows is just proportional to the path length, however. Making the overall
time cost of a phase just O(VE).
2
Taking this altogether, we have |V| phases each costing O(|V||E|) time for a total of O(|V | |E|) time as desired.
Beyond the just seeing algorithms in this lesson, we examined the max-flow min-cut theorem. This was more than just a trick for
proving the correctness of Ford-Fulkerson. Its part of a larger pattern of Duality that provides important insight in a variety of
contexts. Well explore this more fully in our lesson on Duality.
I can label the green vertices as L and the orange ones as R, and every edge is between a green and an orange vertex.
A few observations are in order. First, saying the a graph is bipartite is equivalent to saying that it is two-colorable for those
familiar with colorings.
Next, lets take this same graph and add this edge here to make it non-bipartite.
Note that Ive introduced an odd-length cycle, and indeed saying that a graph is bipartite is equivalent to saying that it has no
odd-length cycles.
For graphs that arent connected its possible that there will be some ambiguity in the partition of the vertices, so sometimes the
partition is included in the definition of the graph, as in G = (L, R, E) .
Note that the graph doesnt have to be bipartite. Take for example, this graph here.
These two edges marked in orange constitute a matching. By the way, well refer to an edge in a matching as a matched edge,
so these two edges are matched, and well refer to a vertex in a matched edge as a matched vertex. Here, there are four
matched vertices.
A maximum matching is a matching of maximum cardinality.
Note that a maximal matching is not a maximum matching. For instance, the shown above is maximal because I cant add any
more edges and have it still be a matching. On the other hand, it is not maximum because here is a matching that has greater
cardinality.
This happens in bipartite graphs too. It is possible to have a maximal matching that is not maximum because there is a greater
matching.
Applications - (Udacity)
Now that we know what bipartite graphs and matchings are, lets consider where the problem of finding a maximum matching in
a bipartite graph comes up in real-world applications. Actually, well make this a quiz where I give you some problem
descriptions and you tell me which can be cast as maximum matching problems in bipartite graphs.
First consider the compatible roommate problem. Here some set of individuals give a list of habits and preferences and for each
pair we decide if they are compatible. Then we want to match roommates so as to get a maximum number of compatible
roommates.
Next, consider the problem of assigning taxis to customers so as to minimize total pick-up time.
Another application is assigning professors to classes that they are qualified to teach. Obviously, we hope to be able to oer all
of the classes we want.
And lastly, consider matching organ donors to patients who are likely to be able to accept the transplant. Of course, we want to
be able to do as many transplants as possible.
Check all the applications that can naturally cast as a maximum matching problem.
We build a flow network that has the same set of vertices, plus two more which serve as the source and the sink. We then add
edges from the source to one half of the partition and edges from the other half of the partition to the sink. Edges are given a
direction from the source side to the sink side. All the capacities are set to 1. Setting the capacities of the new edges to 1 is
important to ensure that the flow from or to any vertex isnt more than 1.
Having constructed the graph, we then run Ford-Fulkerson on it, and we return the the edges with positive flows as the
matching. Actually, all edges will have flows 0 or 1 as well see.
The time analysis here is rather simple. Building the network is O(E) maybe O(V) depending on the representation used.
This is small, however, compared to the cost of running Ford-Fulkerson which is O(EV). Note that V is a bound on the total
capacity of the flow and hence a bound on the total number of iterations. In this particular case, this gives a better bound than
given by the scaling algorithm or by Edmonds-Karp.
Lastly, of course, returning the matched edges is just a matter of saying for each vertex on the left, which vertex on the right
does it send flow to. Thats just O(V) time.
Clearly, Ford-Fulkerson is the dominant part and we end up with an algorithm that runs in time O(EV).
The conservation of flow then implies that each vertex is in at most 1 edge of that we return. On the left, hand side, we only
have one unit of capacity going in, so we cant have more than one unit of flow going out. On the right we have only, one unit of
capacity going out so we cant have more than one unit of flow going in. This means that the set of edges in the original graph
that have flow 1 represent a matching.
Is it a maximum matching? Well, if there were are larger matching it would be trivial to construct a larger flow just by sending
flow along those edges, so yes it must be a maximum matching.
In a network that arises from a bipartite matching problem, there are a few special phenomena that are worth noting. First,
observe that all intermediate flows found by Ford-Fulkerson correspond to matchings as well. If there is flow across an internal
edge, then it belongs in the matching.
Also, because flows along the edges are either 0 or 1, there are no antiparallel edges in the residual network. That is to say that
either the original edge is there or the reverse is, never both. Moreover, the matched edges are the ones that have their direction
reversed. Also, only unmatched vertices have an edge from the source or to the sink. Matched vertices have these edges
reversed.
The result of all of this is that any augmenting path must start at an unmatched vertex and then alternately follow an unmatched,
then a matched edge, then an unmatched edge, and so forth until it finally reaches an unmatched vertex on the other side of the
partition.
This realization allows us to strip away much of the complexity of flow networks and define an augmenting path for a bipartite
matching in more natural terms.
We start by defining the more general concept of an alternating path.
Given a matching
M an alternating path is one where the edges are alternately in m and not in M . An augmenting path is
an alternating path where the first and last edges are unmatched.
For example, the path shown in blue on the right is an augmenting path for the matching illustrated in purple on the left.
We use it to augment the matching by making the unmatched edges matched and the matched ones matched along this path.
This always increases the size of the matching because before we flipped there was one more unmatched edge than matched
edge along the path, so when we reverse the matched and unmatched edges, we increase the size of the matching by one.
In fact, we can restate the Ford-Fulkerson method purely in these terms.
1. Initialize M
= .
M.
= M p.
We also say that S covers all the edges. Take this graph, for example. If we include the lower left vertex marked in orange, then
we cover three edges.
By choosing more vertices so as to cover more edges, we might end up with a cover like this one.
One pretty easy observation to make about a vertex cover is that its size serves as an upper bound for the size a matching in
the graph.
In any given graph, the size of a matching is at most the size of a vertex cover.
The proof is simple. Clearly, for every edge in the matching at least one vertex must be in the cover, and all of these vertices
must be distinct because no vertex is in two matched edges.
Now, we are ready for the matching equivalent of the maxflow/mincut theorem, the max-matching/min vertex cover theorem.
The proof is very similar. We begin by showing that if M is a matching then it admits no augmenting paths. Well, suppose not.
Then there is some augmenting path, and if we augment M by this path, we get a larger matching, meaning that M was not a
maximum as we had supposed.
Next, we argue that if M admits no augmenting paths. Then we claim that there there exists a vertex cover of the same size as
M. This is the most interesting part of the proof. Well let H be the set of vertices reachable via an alternating paths from
unmatched vertices in L (the left hand side of the partition).
We can visualize this definition by starting with some unmatched vertices in L, then following its edges to some set in R, then
including its the vertices in L that these are matched with, etc.
Note that because, M doesnt admit any augmenting paths, all of these paths must terminate in some matched vertex in L.
Lets draw the rest of the graph here as well.
We have some matched vertices in L, the vertices in R that they are matched to, and some unmatched vertices in R. Note that
H and H correspond to the minimum cut we used when discussing flows. To get a min vertex cover, we select the part of H
that is in R and the part of L that is not.
We call this set S. This set S contains exactly one vertex of each edge in M, so |S| = |M|.
Next, we need to convince ourselves that S is really a vertex cover. The edges we need to worry about are those from L H to
R H . Such an edge cannot be matched by our definition, and any such unmatched edge would place the vertex in R into H .
Therefore, there are no such edges and S is a vertex cover.
Finally, we have to prove that the existence of a vertex cover that is the same size as a matching implies that the matching is a
maximum. This follows immediately from our discussion that a vertex cover is an upper bound of the size of a matching.
Note that the fact that the neighborhood of X is larger than X bodes well for the possibility of finding a match for all the vertices
on the left hand side. At least, there is a chance that we will be able to find a match for all of these vertices.
When this is not the case, as seen below, then it is hopeless.
Regardless of how we match the first two, there will be no remaining candidates for the third vertex.
We can make this intuition precise with the Frobenius-Hall Theorem which follows from the max-matching/min vertex cover
argument.
Given a bipartite graph G
|N(X)| |X|.
= (L, R, E) , there is a matching of size |L| if and only if for every X L, we have that
The forward direction is the simpler one. We let M be a matching of the same size as the left partition, let X be any subset of
this side of the partition, and we let Y be the vertices that X is matched to.
Because the edges of the matching dont share vertices, |Y| = |X|. Yet, Y N(X), implying that
|X| = |Y| |N(X)|. neighborhood of X is at least the size of X.
The other direction is a little more challenging. Suppose not. That is, there is a maximum matching M, and it has fewer than |L|
edges. We let H be the set of vertices reachable via an alternating path from an unmatched vertex in L. This is the same picture
used in the max-matching/min-vertex cover argument. There is at least one such unmatched vertex by our assumption here.
The neighborhood of the left side of H is just the right side of H by construction, but the left hand side must be strictly larger,
because the matched vertices on either side eectively cancel each other out, leaving the unmatched vertices in L as extras.
That is to say all of the vertices are matched. To review some of the key concepts from the lesson so far, well do a short
exercise. Assume that the left and right sides are of the same size. Which of the following implies that there is a perfect
matching?
We first initialize the matching to the empty set. Then we repeat the following: first build an alternating level graph rooted at the
unmatched vertices in the left partition L, using a breadth-first-search. Here the original graph is shown on the left and the
associate level graph on the right.
Having built this level graph, we use it augment the current matching with a maximal set of vertex-disjoint shortest augmenting
paths.
We accomplish this by starting at the the unmatched vertex in R and tracing our way back. Having found a path to an
unmatched vertex in R , we delete the vertices along the path as well as any orphaned vertices in the level graph. (See the video
for an example.)
Note that we only achieve a maximal set of vertex-disjoint paths here, not a maximum. Once we have applied all of these
augmenting paths, we go back an rebuild the level graph and keep doing this until no more augmenting paths are found. At that
point we have found a maximum matching and we can return M.
From the description, it is clear that each iteration in this loopwhat we call a phase from now ontakes only O(E) time. The first
part is accomplished via bread-first-search and the second also amounts to just a single traversal of the level graph.
phases are needed.
The key insight is that only about the V
Our overall goal, then is to proof the theorem stated in the figure below.
augment M.
Let Q be a shortest augmenting path after the phase with path length k. Its impossible for |Q| to be less than k by a previous
lemma, which showed that augmenting by a shortest augmenting path never decreased the minimum augmenting path length.
On the other hand |Q| being equal to k implies that Q is vertex disjoint from all paths found in the phase. But then it would have
been part of the phase, so this isnt possible either.
Thus, |Q| > k, and also odd because it is augmenting, so |Q| k + 2, completing the lemma.
V , there will only be V phases left. Let M be the matching found by HopcroftThe key argument will be that after roughly
phases. Because each phase increased the shortest augmenting path length by
Karp and let M be the matching after the V
+ 1.
two, the length of the shortest augmenting path in M is 2V
Hence, no augmenting path in the dierence between M and M can have shorter length. This implies that there are at most |V|
divided by this length augmenting paths in the dierence. We just run out of vertices. If there can be only so many augmenting
.
paths in the dierence, then M cant be too far away from M certainly, at most V
more phases.
Hence M will only be augmented square root of V more times, meaning that there cant be more than V
phases to make all of the augmenting paths long enough so that there cant only be V
more possible
We have V
augmentations. That completes the theorem.
) time.
In summary, then the Hopcroft-Karp algorithm yields a maximum bipartite matching in O(EV
As far as concepts go, the ideas of how to represent systems of equations as matrices, of linear independence of vectors,
matrix rank, and the inverse of a matrix should all be familiar. If they arent, then it would be a good idea to refresh your
understanding before watching the rest of this lesson.
As always, it is recommended that you watch with pencil and paper handy so that you can pause the video and work out details
on your own as needed.
max
s.t.
x1
x1
x1
x2
2x2
x1 , x2
14
2
0
We express the fact that he only has 14 hours for these activities by saying that x 1 + x2 is at most 14. We express the fact that
he feels for half as much relaxations as work after two hours of work with the second constraint x1 2x2 2. Of course, he
cant spend negative time on either of these activities, so we need to add that constraint as well. The overall goal is to maximize
time worked, so we make that our objective function, and we want to maximize that subject to these constraints.
Now in HS, your teacher probably asked you to begin by graphing the inequalities. When we do this, we see that the constraints
generate the following polytope.
Perhaps, your HS teacher didnt use the word polytope thats what this region here is. Each constraint restricts our solution to
half of the plane, called a half-space, and a polytope is the intersection of half-spaces.
After you graph this region, the solution can be picked out as one of the the vertices. In this case, its pretty easy to see that its
this one on the right, which is at the intersection of the two problem constraints. Maybe, if the formula was a little more
complicated and you werent sure, you could have tested each one of the vertices and picked the one with the highest objective
value.
Why is the optimal solution at one of the vertices? Well, remember that in this problem and all similar ones from High school, the
objective function, the thing were optimizing, is linear. The only thing that matters is how far we can move in a certain direction ,
in this case the x1 direction, but it could be any direction in the plane.
If you like, you can think of there being a giant magnet infinitely far away pulling our point x in a certain direction.
In trying to get as close as possible to the magnet, this point must end up at one of the vertices.
If some point is interior, then we can clearly improve by moving in this direction. If a point is on an edge, then we can improve by
moving along this edge. The only time we couldnt improve in this way would be if the edge were perpendicular to the direction
we wanted to move in. But then \emph{both} vertices on either side of the segment have the same value and therefore are also
optimal solutions.
Thinking more abstractly, there isnt always an optimal solution, as there is in this particular case. If I eliminate one constraint as
shown below, then polytope is unbounded in the gradient direction for our objective.
In this case, we can keep moving our point x further and further getting greater values for our object. You give me an x Ive got
a better one! Hence there is no optimal solution.
On the other hand, if I put back that constraint and add another one, we might find that they are contradictory.
There is no way to satisfy them. If there are no solutions, there cant be an optimal one.
So those are the three things that can happen: the constraints can create a bounded region and we find an optimum, the region
can be unbounded, in which case we might find an optimum or the problem might be unbounded, or the region can be empty.
(Note that inequality over matrices means that the inequality holds for each element.)
When we first encounter an linear programming optimization problem, it might not be in this form. In fact, the only requirements
for an optimization problem being a linear program are that both the objective function and the constraints, inequalities or
equalities, be linear. If this is true then, we can always turn it into a canonical form like this one.
Here are the key transformations.
Things get a little more interesting when we go from one of the inequalities to an equality. Here we introduce a new variable,
called a slack or surplus variable depending on the inequality.
There is also the problem of free variables that are allowed to be negative. There are two ways to cope with one of these. If it is
involved in an equality constraint, then you can often simply eliminate it through substitution. Otherwise, you can replace it with
the dierence of two new non-negative variables.
To better understand the relationship between these two forms, Im going to write the standard form in terms of the symmetric
form.
To convert the m inequalities to equalities we introduce m slack variables x n+1 xn+m . Of course, this means that we need to
augment our matrix A as well so that these slack variables can do their job. And c also needs to be augmented so that we can
multiply it with the new x without changing the value of the objective function.
Geometrically, weve switched our optimization from being over a polytope in n dimensions (note the inequalities in the
symmetric form) to being over the intersection of a flat (note the equality constraints) intersected with cone defined by the
positive coordinate axes (note the non-negativity constraints) in n + m dimensions.
We expect that an optimum for the symmetric problem will be one of the vertices of the polytope, where n of the hyperplanes
defined by the constraints intersect. That is to say, of these n + m constraints, ( m from A and n from the non-negativity of x) n
must hold with equality, or be tight in the common parlance. Some might come from A, others from the non-negativity
constraints, but there will always be n tight constraints.
Over in standard form, the notion of whether the constraints from A are tight are not is captured by the slack variables that we
introduced. A slack variable is zero if and only if the corresponding constraint is tight. Thus, at least n of the variables will be
zero when we are at a vertex of the original polytope. In fact, if I tell you which n variables are zero and these correspond to an
linearly independent set of constraints, then you can construct the rest of the variables based on the equality constraints.
Now, so far I have kept on using the number of variables n and the number of constraints m from the symmetric form, even as
we talk about the standard form. In general, however, when discussing the standard form we redefine the new n to be the total
number of variables (the old n + m ).
One other thing to note about this equality form is that enforce the matrix A to have rank m where m is the number of
constraints. That is to say, the rows should be linearly independent. If the rows arent linearly independent, there are two
possibilities. One is that the constraints are inconsistent meaning that there is no solution. The other possibility is that the
constraints are redundant meaning that some of them can be eliminated. So from now on well assume that A has full rank.
Ax = b.
A basic solution is one generated as follows: well pick an increasing sequence of m column numbers so that the
corresponding columns are independent and call the resulting matrix B. This is easiest to see when the chosen columns are the
first m , and well use this convention for most of our treatment.
B
We define xB = B b and then embed this in the longer vector x , putting in the value from xB for columns in our sequence and
zero otherwise. Then x is a basic solution.
Really, all that were trying to accomplish here is to let x B get multiplied with the columns of A corresponding to B and have
zero multiplied with all the other columns. (Remember post-multiplication corresponds to column operations).
So thats a basic solution, and we call it basic because it came from our choice of this linearly independent set of columns,
which forms a basis for the column space. From the basis, we get a basic solution.
It is possible for more than one basis to yield the same basic solution if some of the entries of xB are zero. Such a solution is
called degenerate. This corresponds to a vertex being the intersection of more than n hyperplanes in the symmetric form.
So far, this vocabulary only addresses the equality constraints. Adding in the non-negativity constraints on the variables, we use
the word feasible. Thus, a feasible solution is a solution that has all non-negative entries, and a basic feasible solution is
one that comes from a basis as described above and has all non-negative entries.
Well start by proving the first point of the theorem statement above.
Let x be a feasible solution and well consider only the positive entries. Without loss of generality, lets assume that they are the
first p. Then it must be the case that that
x 1 a 1 + + xp a p = b.
That is, after all, part of what it means to be feasible.
Case 1: Suppose first that the columns a 1 a p are linearly independent.
Then, its not possible that p should be greater than m . If p = m, then x is basic, and were done. The quantity p could be less
than m , but then we would just add columns as needed until we formed a basis. That covers the independent case.
Case 2: Suppose that a 1 ap are linear dependent.
That means that there are coecients y1 yp such
y 1 a 1 + + yp a p = 0
with at least one of these coecients being positive. Well then choose
y 1 a 1 yp a p = 0.
T
Note, however, that for suciently close to zero (both positive and negative) x y is feasible. Thus, c y = 0. Otherwise, we
could choose the sign of so as to make
cT x < cT (x y).
y = 0.
Since we assumed that x is an optimal solution we conclude that c y = 0. Therefore, we can set to the same choice as before
that sent one of the coecients 1 p to zero. By repeating this argument, we eventually reach case 1.
Weve just seen how the fundamental theorem of linear programming tells us that we can alway achieve an optimal value for the
program with a basic solution. Moreover, basic solutions come from a choice of m linearly independent columns for the basis.
Remember this key point going forward.
For the simplex algorithm, we want to consider the eects of swapping out one of the current basis columns for another one. To
do this, we first want to identify a good candidate for moving into the basis, one that will improve the objective function. As the
program stands, however, it not immediately which if any are good candidates. Sure, for some xi the coecient might be
positive, but raising that value might force others to change because of the constraints, making the whole picture rather
opaque. Therefore, it will be convenient to parameterize our ability to move around in the flat defined by the equality constraints
solely in terms of xD , the variables that we are thinking about moving into the basis.
To this end, we solve the equality constraint for xB so that we can substitute for it where desired. First we substitute it into the
objective function, and through a little manipulation, we get this expression.
The constant term here doesnt matter, since we are only considering the eects of changing x D . This quantity here that is
multiplied with x D , however, is important enough that it deserves its own name. Lets call it rD .
r DT = cTD cTB B1 D
T
In our reframing of the problem, therefore, we want to maximize r D x. How about the other constraints? Well, the first one goes
away with the substitution. The requirement, however, that xB remain non-negative remains.
Substituting our equation for xB , we get the linear program
max
r DT xD
s.t.
B1 DxD B1 b
xD 0
The answer is that any column corresponding to a positive entry of r D is a good candidate. We want the entry to be positive
because the corresponding element of xD is also going to positive as we increase it. Just picking the greatest entry of rD
doesnt work because this still might be negative.
This idea then becomes the basis for the simplex algorithm. Pick q such that r q > 0 and let xD = eq , just the unit vector along
the qth coordinate axis.
This choice simplifies the optimization even further since xD is now just proportional the qth column of D. Well define
u = B1 Deq and v = B1 b. Now we have
max
r DT eq
s.t.
u v
xq 0
then substitute back into the objective function. Note that the choice of basis (partitioning the columns of A into B and D ) is
We introduce artificial variables y that represent the slack between Ax and b, require these variable to be non-negative, and
then try to minimize their sum.
For this auxiliary program, it is easy to find a basic feasible solution: just set x = 0 and y = b. Therefore, we can start the
simplex algorithm. If we find that the optimum value is zero, then we can start our original program with the values in x. On the
other hand, if the optimum is greater than zero, that means that the original problem was infeasible. This is sometimes called the
Two Phase approach for solving LPs.
The simplex method as it is described here was first published by George Dantzig in 1947. Fourier apparently had a similar idea
in the early 19th century, and Leonid Kantorovich had already used the method to help the Soviet army in World World II. It was
Dantzigs publication, however, that led to the widespread application of the method to industry as the lessons of operations
research learned from the war began to be applied to the wider economy and fuel the post-war economic boom. It remains a
popular algorithm today.
As practical as the algorithm was, theoretical guarantees on its performance remained poor, and in fact, in 1972 Klee and Minty
showed that the worst-case complexity is exponential. It wasnt until 1979 that a Khachiyan published the ellipsoid algorithm
and showed that linear programs can be solved in polynomial time. His results were improved upon in 1984 by Karmarkar,
whose method turned out to be practical enough to be competitive with the Simplex method for real-world problems. Both of
these algorithms take shortcuts through the middle of the polyhedron instead of always going from vertex-to-vertex.
In the next lecture, well talk about the duality paradigm, which rises out of linear programming and has been the source of
many insights and the inspiration for new algorithms. Even with a whole other lesson, however, we are only able to scratch the
surface of the huge body of knowledge surrounding this fundamental problem that has shown itself to be of deep importance in
both theory and practice.
Duality - (Udacity)
Introduction - (Udacity, Youtube)
Every linear program, it turns out, has a dual program which mirrors the behavior of the original. In this lesson, we will examine
this phenomenon to give us a chance to apply some of the knowledge we gained about linear programs, as well as to deepen
our understanding of some other problems that weve already studied. See if you can guess which problems as the lesson goes
along.
Bounding an LP - (Udacity)
I want to start o our discussion with a little exercise where we try to find an upper bound on the value of a linear program. Well
start with this linear program here,
and were going to take a linear combination of these inequality constraints to obtain a bound on the objective function.
Multiplying the first inequality by y 1 and the second by y2 , and adding them together, we obtain this inequality here. Note that it
is important that the ys be non-negative to avoid reversing the inequality.
If we chose y 1 and y 2 such that 6 2y1 y2 and 2 y1 + 2y2 , then the objective function can be at most the left hand side
of our new inequality, which can be at most the right.
As we saw in the exercise, the dual program can be thought of as the problem of minimizing an upper bound on the primal.
T
T
Note that for all feasible y , we have b y is at most y Ax using the constraint from the primal and the nonnegativity of y. And this
T
is at most c x, using the constraint from the dual and nonnegativity of x.
In fact, we just proved the Weak Duality Lemma, which states that if x is feasible for the primal problem and y is feasible for the
T
T
dual problem, then c x is at most b y .
Another thing to note here, is that if your primal problem isnt in this exact form, you can always convert it, then look at the
corresponding dual and simplify. Often, however, it is easier just to remember that the dual is the problem of bounding the
primal as tightly as possible. For instance, if we change the inequality in the primal to equality, then we can proceed by the
same argument, only this first inequality becomes an equality, and I dont have to rely on y being non-negative. Everything else
is the same.
Well, the answer is ``Yes, they always do. More precisely, we state this as follows in the Duality Theorem.
If either the primal problem or the dual has a feasible optimal solution, then so does the other, and the optimal objective
values are equal. If either problem has an unbounded objective value, then the other is infeasible.
Well start the proof by showing the second part. Suppose the primal is unbounded and y is a feasible for the dual. (Were going
T
T
to show that both of these cant be true.) By weak duality, b y c x for all feasible x . Since the primal is unbounded, however,
T
T
I can find x that gives me a value as high as I want. Whatever, the value of b y is, I can find a feasible x such that c x is larger,
which creates a contradiction. The case where the dual is unbounded, is analogous.
Now, we return to the first part: If either the primal problem or the dual has a feasible optimal solution, then so does the other,
and the optimal objective values are equal. Lets start with the primal having a finite optimal solution. From this it follows that
there is a finite basic optimal solution by the Fundamental Theorem of LP. Lets let the basis be the first m columns of the matrix
A as usual and divide x and c up accordingly. (As usual B stands for basic here)
Recall then from the simplex algorithm that the vector r D which represented the eects of moving along on of the directions in
xD had to be nonpositive. I.e.
1
0 rDT = c TD cTB B D.
Otherwise, this basic solution wasnt optimal. Now, were going to actually construct a solution for the dual. Letting
yT = cTB B1
T
, we have that y D cTD from the nonpositivity of r . Therefore,
yT b = cTB B 1 b = cTB xB .
where x is the basic optimal solution. By weak duality, this is the best we can do, so both y also is optimal.
Its natural to ask, are these phenomena all related? Well, yes they are and probably the easiest way to see that is to realize that
all of these can be characterized as linear programs and their duals.
Lets take a look at the duality of maximum matching in bipartite graphs first.
Well let the variable xij indicate whether x ij should be included in the matching. Then as a linear programming the problem
becomes to maximize the number of matched edges subject to the constraints that no vertex in L can be matched more than
once and no vertex in R can be matched more than once. Of course, we cant have negatively matched edges.
To build the dual program, we let yi and y j be the variables corresponding to these constraints, and we want to minimize their
sum because the constraint vector here is just all ones.
For the constraints, observe that the coecients in the objective function are 1 and that any xij appears once in the equation for
i and once in the equation for j.
Hence y i + yj 1. And of course yi and yj cant be negative.
The interpretation here is straightforward: vertex i is in the cover if and only if y i = 1 and similarly vertex j is in the cover if and
only if yj = 1.
Every edge must have at least one vertex in the cover and we are trying to minimize the size of the cover.
So we have just seen how maximum bipartite matching can be expressed as a linear program and its dual also turned out to
have a natural interpretation as the vertex cover problem. This is really neat. Every decision problem in P can be converted to a
linear program ultimately, just because linear programming is P-complete, but not every conversion will result in variables and a
dual program that have such intuitive interpretations. When this happens, it often gives a way to gain deeper insight into a
problem and its structure.
As you might have guessed, this happens also for the max-flow/min-cut problem and well explore that next.
To express the dual well use yu for conservation contraint at vertex u and yuv for capacity contraint at edge (u, v). Two
subscripts mean a capacity contraint, one subscript means a conservation constraint.
The dual problem is to minimize the sum over all edges of cuv yuv . Note that the yu s have no role in the objective function
because their coecients are zero.
The constraints for the dual involve several cases. Well consider first those arising from the objective function coecients being
one for edges out of the source. The flows appear onces in the capacity constraint and once in the conservation equation for
the receiving vertex.
The case for edges going into the sink is analogous. The flow is present in the capacity constraint and in the conservation of
flow equation for the sending vertex. These must be at least one because the objective function coecient is zero.
For all other edges, the the flow appear in the capacity constraint and BOTH conservation of flow equations. Again, the
coecient in the objective function is zero so that becomes the constraint. And these dual variables have to be nonnegative.
The interpretation of these dual variable can be a little tricky so, Im going to rearrange the constraints to isolate the capacity
variables on the left-hand side.
This makes it a little easier to see what is going on. Actually, I think this would make a good exercise.
Interpretation of y - (Udacity)
Suppose that y is a basic optimal feasible solution for the given LP. Which statements are part of an interpretation of y as an s-t
cut, say (A, B)?
As input, we are given a graph and we want to find the smallest set of vertices such that every edge has at least one end in this
set. Recall that this problem is NP-complete. We reduced maximum independent set to it earlier in the course.
The approximation algorithm goes like this.
We start with an empty set, and then while there is still an edge that we havent covered yet, we chose one of these edges
arbitrarily and add both vertices to the set. Then we remove all the edges incident on u and all those incident on v, since those
edges are now covered. Next, we pick another edge and remove all edges incident on it. And so on, and so forth until there
arent any edges left. At the end of this process the set C, which weve picked must be a cover. (See the video for an animation.)
Looking back at the original graph, its not too hard to see that a cover obtained in this way need not be a minimum one. Here is
a cover obtained with the algorithm (orange) and optimal cover (green).
In this case, ApproxVC algorithm returned a cover twice as large as the optimal cover. Fortunately, this is as bad as it gets.
|C|
Given a graph G that has minimum vertex cover C , the algorithm VCApprox returns a vertex cover C such that | C |
2.
To prove this, it is useful to consider the set of edges chosen by the algorithm at the start of an iteration. Well call this set M.
We use the letter M here because this set is a maximal matching. It pairs o vertices in such a way that no vertex is part of
more than one pair. In other words, this set of edges must be vertex disjoint. That means that in order to cover just this set of
edges, any vertex cover must include at least one vertex from each. Therefore, |M| |C |.
Since C is a minimum cover, the set C chosen by the algorithm can only be larger. It is of size 2|M|. Altogether then,
C and C - (Udacity)
Given this theorem, lets explore the relationship between the size of the optimum cover and the one returned by our ApproxVC
algorithm with a quick question. Suppose that G is a graph with a minimum vertex cover C and that our ApproxVC algorithm
returned a set of vertices C. Fill in these blanks below so as to make these statements as strong as possible.
We have the size of the set returned by our algorithm |C| and size of the optimal one C . We would like to be able to find some
GUARANTEE about the relationship between the two. The trouble, of course, is that we dont know the optimal value. Actually,
finding the optimal value is NP-complete, and thats why we are searching for an approximation algorithm anyway.
We resolve this dilemma by finding a lower bound on the size of the optimal cover in the maximal matching that the algorithm
finds. Then we find the UPPER bound on our approximation then in terms of this lower bound. Note that the approximation is
not enough to tell us enough about optimum value to allow us to solve the decision version of the problem: does the graph have
a vertex cover of a particular size?
Our approximation might be twice the optimum value,
Since we cant tell which situation we are in, we cant tell where in this range the the minimum vertex cover falls.
The first thing we need is a set of problem instances. For the example of minimum vertex cover this is just the set of undirected
graphs. For each instance, there is a set of feasible solutions. For min vertex cover, this is the set of covers for the graph. Next,
we need an objective function, the thing we are trying to optimize. For min vertex cover this is the size of the cover. And we
need to say whether we are minimizing or maximizing the objective. For min vertex cover, we are minimizing it of course.
Relating this back to our treatment of complexity, we say that an NP-optimization problem is one where these first three criteria
are computable in polynomial time. That is to say, there is a polynomial algorithm that says whether the input instance is value,
one that can check whether a solution is feasible for the given instance, and one that can evaluate the objective function.
Now, every optimization problem has a decision version of the form, is the optimum value at most some value for the min and at
least some value for the max. For minimum vertex cover, we ask is there a cover of size less than some threshold. With this in
mind, we can then say an optimization problem is NP-hard if its decision version is. A problem is NP-hard, by the way, if an NPcomplete problem can be reduced to it. In our example, min vertex cover is NP-hard because the decision version is.
Remember that reduced from the maximum independent set problem.
> 0, there is an O(
n 2 log t
) time, factor (1 + )
The smaller the epsilon, the better the approximation, but he worse the running time.
This is a remarkable result. It may be intuitive that one should be able to trade-o spending more time for a better
approximation guarantee, but it isnt always the case that we get to do so arbitrarily as in this theorem. Because this isnt a
particular algorithm but rather a kind of recipe for producing an algorithm with the right properties, we call this a polynomial time
approximation scheme, or PTAS for short. For every , you choose there is algorithm that can approximate that well.
This approximation scheme is extra special because the running time is polynomial in 1/ as well as polynomial the size of the
input. Therefore, we say that this is a fully polynomial time approximation scheme.The alternative would be for the epsilon to
appear in one of the exponents, perhaps. Then it would just be a polynomial time approximation scheme.
We are given a graph G. Usually, all the edges are present, and with each of them is associated some cost or distance. Well
assume that all the edges are present, so we wont draw them in our examples like this one here. The goal is to find the
minimum cost Hamiltonian cycle. That is to say, we want to visit each of the vertices without ever visiting the same one twice.
This problem is NP-complete in general. And even a constant factor approximation is impossible unless P=NP, as we will prove
next.
Salesman problem.
For the proof, we reduce from the Hamiltonian cycle problem, where are given a graph, not necessarily complete this time, and
we want to know if there is a cycle that visits every vertex exactly once. Here then is how we set up the traveling salesman
problem. We assign a cost of 1 to every edge in the original graph G and assign a cost of |V| + 1 for every edge not in the
original graph.
Clearly, then if G has a Hamiltonian cycle, then the optimum for the traveling salesman problem has a cost of |V|, with a cost of
1 for every edge that it follows. A factor approximation would then find a Hamiltonian cycle with cost at most |V|. Letting H
be an optimal Hamiltonian cycle for the TSP problem and letting H be the cycle returned by the -approximation , we have that
c(H) |V|.
On the other hand, if the original graph G has no Hamiltonian cycle, then the cost of the one returned by the approximation
algorithm must be at least as large as the optimum, which must follow at least one edge not in the original graph. Hence,
Here is the approximation algorithm. We start by building a minimum spanning tree. The usual approach here is to use one of
the greedy algorithms, either of Kruskal or Prim, that are typically taught in an undergraduate class. In Kruskals algorithm, the
idea is simply to take the cheapest edge between two unconnected vertices and add that the graph until a tree is formed. (See
the video for an animation.)
Next, we run a depth-first search on the tree, keeping track of the order in which the vertices are discovered. For this example,
lets label the vertices with the letters of the alphabet, and start from C. Then the discovery order would go something like this.
Note that this cycle follows along the tree at first, from c to b to e to h to i, nut instead of back tracking to h, it goes directly to j.
Then, it goes directly to d, and so on.
This cycle seems to always be taking short-cuts compared to the traversal the depth-first search performed. For the general
Traveling Salesman problem, we cant be sure these are in-fact short-cuts, because we cant assume the triangle inequality.
Where we do have the triangle inequality, however, these *will be short cuts, and as well see that will be the key to the analysis.
O(
The process for building the min spanning tree is O(V ) for dense graphs and the depth first search process takes time
proportional to the number of edges, which is the same as being proportional to the number of vertices for trees. That takes
care of the eciency of the algorithm.
Now for the factor-two part. Consider this example here,
and let H be a minimum cost Hamiltonian cycle over this graph. This is what an exact algorithm might output. Well, the cost of
the minimum spanning tree that the algorithm finds must be less than the total cost of the edges in this cycle. Otherwise, just
removing an edge from the cycle would create a lower cost minimum spanning tree. (Remember that the cost must be nonnegative). Thus, for a minimum spanning tree T , we have
eT
c(e)
eH
c(e).
The cost of a depth-first search traversal is twice the sum of the costs of the edges in the tree, i.e. 2 eT c(e) This also starts
and ends at the same vertex, so its a cycle.
The trouble is that its not Hamiltonian. Some vertices might get visited twice. Its easy enough, however, to count only the first
time that a vertex is visited. In fact, this is what ordering the vertices by their discovery time achieves.
By the triangle inequality, skipping over intermediate vertices can only make the path shorter, so the cost of this cycle is most
the cost of the depth-first traversal. Thus,
eT
e H
As we argued, the cost of a minimum spanning tree must be less than the cost of the optimum cycle. We can just delete one
edges from the cycle and get a spanning tree. A depth first traversal of the spanning tree uses every edge twice, and therefore
is twice the cost of the tree. Shortcutting all but the first visit to a vertex in this traversal gives a Hamiltonian cycle, which MUST
have lower cost because of the triangle inequality.
Let the blue edges have cost 1 and the red ones have cost 2. Enter a minimum cost solution in the box below.
Then, preferring lowest indexed vertices a depth-first traversal would produce this cycle.
Notice that every edge followed in this cycle is a red one except the first and the last. Hence the cost is 2 6 2 = 10. The
ratio is therefore 10/6 . But there wasnt really anything special about the fact that we were using 6 vertices here. We can form an
analogous graph for any n, letting the lighter edges be the union of a star and a cycle around the non-center vertices. All other
edges can be heavy.
My question to you then is how bad does the approximation get for 100 vertices? Give your answer in this box.
Being slightly more general, we can state the problem like this. Given representations of polynomials A and B having degree d,
decide whether they represent the same polynomial. Note that we are being totally agnostic about how A and B are
represented. We are just assured that A and B are indeed polynomials, and we have a way of evaluating them. Well, here is a
fantastically simple algorithm for deciding whether these two polynomials are equal.
1. Pick a random integer in the range [1, 100d].
Why does this work? Well, the so-called Fundamental Theorem of Algebra says that any non-zero polynomial can have at most
d roots.
So if A and B are dierent, the bad case is that it has d roots and whats worse that all of them are integers in this range
[1, 100d]. Even so, the chance that the algorithm picks on of these is still only 1/100. So if the polynomials are the same, we will
always say so, but if they are dierent, then we will say they are the same is a chance 1 in 100. This is pretty eective, and if it is
found that A is not equal to B in some case, your algorithm is so simple that there cant be much dispute o over which piece of
code is incorrect.
A discrete probability space consists of a sample space Omega that is finite or countably infinite. This represents the set of
possible outcomes from whatever random process we are modeling. In the case of the previous algorithm, this is the value of x
that is chosen.
Second, a discrete probability space has a probability function with these properties. It must be defined from the set of subsets
of the sample space to the reals. Typically, we call a subset of the sample space an event E. For every event, the probability
must be at least 0 and at most 1. The probability of the whole sample space must be 1. And for any pairwise disjoint collection
of events the probability of the union of the events must be the sum of the probabilities.
To illustrate this idea with the polyequal example, lets define the events Fi to be the set consisting of the single element i, i.e.
Fi = {i}. This corresponds to i being chosen as the value at which we test the polynomials. And we define the probability of
the event Fi as Pr(Fi ) = 1/100d. Now, these Fi arent the only possible events. They are just the single elements sets of the
probability space. We need to define our function over all subsets. But actually, we have done so implicitly already because of
property 2c above. For any subset S of the sample space, we have that the subset is the union of the individual events Fi .
These are disjoint, so the probability of the union is the sum of the probabilities, and so the result is the size of the set divided
by the size of the sample space, as we would expect. That is to say
|S|
Pr(S) = Pr(iS Fi ) = Pr(Fi ) =
||
iS
Lets confirm that this function meets all of the requirements of the definition. Property 2c holds because the size of a disjoint
union is the sum of the sizes of the individual sets. We see from the ratio Pr(S) = |S|/|| that the probability of the whole sample
space is 1. And the probability of any event is between 0 and 1. So this example here is, in fact a discrete probability space.
By the way, this example probability function is called uniform, because the probability of every single element eventthat is to
say the Fi here are the same. Not all probability functions are like that.
We start out by assuming that the two polynomials are equal. Then we try dierent values for x, and if we ever find a value for
which the two polynomials are not equal then we know that they arent. Note that we could terminate as soon as a dierence is
found, but this version of the algorithm makes the analysis a little more clear.
For simplicity, well make this k equal to 2 so that we can visualize the sample space with a 2D grid like so.
The row corresponds to the value of x in the first iteration, the column to the value of x in the second iteration. Now, the size of
2
2
2
the sample space is (100d) and since there are d pairs of roots for the dierence between A and B, at most d of these
possibilities make the algorithm fail.Well let F be the event that the algorithm fails on unequal polynomials. That is, the
probability that A(x) = B(x) for both chosen x. By symmetry, we can argue that all elements of the sample space should have
equal probability, so the probability of the algorithm failing on two unequal polynomials
Pr(F)
d2
(100d )
1002
1
100
Similarly, we let E2 be the event that the polynomials are equal at x in the second iteration. In the grid, this event corresponds to
a certain subset of the columns. Again, these red columns take up at most 1/100th of the whole probability mass.
We are interested in the probability of both E1 and E2 happeningi.e. the intersection of these two events, represented as the
black region in our grid.
What fraction of the probability mass does it take up? Notice that in order for a sample to fall into the black region, the first
iteration must restrict us to the the blue region. The probability of this happening is is just the probability of E1 . Then from within
the blue region, we ask what fraction of the probability mass does of E2 take up? Well, thats just Pr( E1 E2 )/ Pr(E1 ), and we
want to multiply this quantity with Pr(E1 ) to get the result. This sort of ratio is common enough that it gets its own name and
notation. We notate it like so,
Pr(E2 |E1 )
Pr(E1 E2 )
,
Pr(E1 )
and read this as the probability of E2 given E1 . This is called a conditional probability. The interpretation is that it gives the
probability of E2 happening, given that E1 has already happened.
More specifically for our polynomial verification, the probability that the second iteration will pick a value where the polynomials
are equal given that the first one did. Well, of course, this is that same probability as E2 happening, regardless of what
happened with in the first iteration, so this is just the probability of E2 . This is a condition known as independence, and it
corresponds to our intuitive notion of one event not depending on another.
Substituting in these values we find that that this approach gives the same result as the other.
1002
We can visualize this quantity using a traditional Venn diagram. We draw the whole sample space as a large rectangle, and
within here we draw the set F like so.
When we talk about conditional probability of E given F, we are restricting ourself to the set F within the sample space. Thus,
only the portion of E that is also in F is important to us. To make this a proper probability we have to renormalize by dividing by
the probability of F. That way, the probability of E given F and the probability of not E given F sum up to 1.
Pr(E F)
Pr(E F)
Pr(F)
E
Pr(E|F) + Pr( |F) =
+
=
=1
Pr(F)
Pr(F)
Pr(F)
An interesting situation is where the probability of an event E given F is the same as the probability when F isnt given.
Pr(E|F) = Pr(E)
This implies that E and F are independent. One doesnt depend on the other. Formally, we say
Two events E and F are independent if
Bullseyes - (Udacity)
Here is a quick question on independence. Suppose that there is a 0.1 probability that Sam will get a bulleyes each time that he
throws a dart. What is the probability that he gets 5 bullseyes in a row?
Lets go back to our polynomial verification algorithm with repeated trials and review its behavior.
If the two input polynomials are equal, then the probability that the algorithm says so is 1. On the other hand, if the polynomials
k
are dierent, there is a chance that the algorithm we get the answer wrong, but this happens only with probability 100 where
k is the number of trials that we did. We just need to extend the argument we made before for k = 2 to general k.
The fact that our algorithm might sometimes return an incorrect answer makes it what computer scientists call a Monte Carlo
algorithm. And because it makes a mistake only when the polynomials dier, it is called one-sided Monte Carlo algorithm.
This idea can be extended to arbritrary languages. Here strings in the language represent equal polynomials. This algorithm only
makes mistakes on strings not in the language. Of course, its possible for the situation to be reversed so that the algorithm
makes mistakes on string in the language. This is another kind of one sided Monte Carlo algorithm. I should say that there are
also two-sided Monte-Carlo algorithm, where both errors are possible, but regardless of what input is given the answer is more
likely to be correct than not.
Suppose, however, that any possibility of error is intolerable. Can we still use randomization? Well, yes we can. Instead of
picking a new point uniformly at random from the 100d possible choices, we can pick one from the choices that we havent
picked before. This is known as sampling without replacement, since we dont replace the same we took back in the pool
before choosing again. There are only d possible roots, by the time weve pick the (d + 1) th point, so we must have picked one
of the non-roots.
This algorithm still uses randomization, but nevertheless it always gives a correct answer. If the polynomials are equal, the
probability that the algorithms says so is 1. If they are unequal, the probability that it says they are equal is 0.
The fact that this algorithm never returns an incorrect answer makes it a Las Vegas algorithm. The randomization can eect
the running time, but not the correctness. If the polynomials are equal, the algorithm definitely takes d + 1 iterations, but when
they are unequal, it gets a little more complicated.
Let Ei be the event that A and B are equal at the ith element of the order array chosen randomly by the algorithm above. Then,
k1
we characterize the probability that the algorithm takes at least k steps as i=0 Ei . Note, however, that these events are no
longer independent. If A and B are equal at the first element of the list order, then thats one fewer root that could have been
chosen to be the second element. So how do we go about calculating this probability?
Returning to the Las Vegas version of our polynomial identity verifier, we can write the probability that we dont detect a
dierence in k iterations as the product
Pr(k1
i=0 ) = Pr(E1 ) Pr(E2 |E1 ) Pr(Ek1 |E1 E2 Ek2 ) =
k1
dk
1
.
100d k
100k
i=0
With a little more work, we can get a tighter bound that this, but for our purpose this simple bound works. Note that even
though the probabilities for the LasVegas and Monte Carlo algorithms are the same, the meanings are dierent. In the Monte
Carlo algorithm, our analysis captured the probability of the the algorithm returning a correct answer. In the Las Vegas algorithm,
our analysis something about the running time for an algorithm that will always produce the correct answer.
is a function X :
For example, let X be the sum of two die throws. Then, the sample space is = {1, , 6} {1, , 6} , and the random
variable X is the function that just adds those two numbers together,
X((i, j)) = i + j.
We use the notation X = a where a is some constant to indicate the event that X is equal to a. Thus, it is the set of elements
for the sample space for which the function X is equal to A. In our example,
X.
A can be seen from the formula, the expectation is a weighted average of all the values that the variable could take on. For
example, let the random variable X be number of heads in 3 fair coin tosses. Then, according to this definition, the expectation
would be:
1
8 for getting no heads.
3
8 for getting 1 head, as there are three possible tosses that could have come up heads.
3
8 for getting 2 heads, as there are three possible tosses that could have come up tails to give us two heads.
1
8 for getting 3 heads.
Adding these all up we get 12/8 = 3/2. Now, if I asked you casually, how many heads will there be in three coins tosses on
average, you probably would have said 3/2 rather quickly and without doing all this calculation. Each toss should get you 1/2 a
head, so with 3 your should get 3 halves, you might have reasoned.
In terms of our notation, we can express the argument like this. We let Xi be 1 if the i th fair coin toss is heads and 0 otherwise.
Then we say that the average number of heads in three tosses is
E[ X1 + X 2 + X 3 ] = E[ X1 ] + E[ X2 ] + E[ X3 ] =
3
.
2
The key step in the proof is the first equality here, which says that the expectation of the sum is the sum of the expectations.
This is called the linearity of expectation, and as we will see, this turns out to be a very powerful idea.
In general,
For any two random variables X and
and
E[aX] = aE[X]
.
The expectation of the sum is the sum of the expectations, and we can just factor out constant factors from expectations.
Remember this theorem.
Use the linearity of expectation and write down the expected number of 3-cycles as a ratio here.
To keep things simple, well assume that the elements to be sorted are distinct. This is a recursive algorithm, with the base case
being a list of 0 or 1 elements, where the list can simply be returned. For longer lists, we choose a pivot uniformly at random
from the elements of the list, and then split the list into two pieces: one with those element less than the pivot and one with
those elements larger than the pivot. We then recursively sort these shorter lists and then join them back together once they are
sorted.
The eciency of the algorithm depends greatly on the choices of the pivots. We can visualize a run of the algorithm by drawing
out the recursion tree. Ill write out the list is sorted order so that we can better see what is going on, though the algorithm itself
will likely have these elements in some unsorted order.
The ideal choice of pivot is always the middle value in the list. This splits the list into two equal-size sublists. One consisting of
the larger elements, the other of the smaller elements. Then in the recursive calls, we split these lists into two pieces, until we
get down to the base case.
Because the size get cut in half with each call, there are only log n levels to the tree. Every element gets compared with a pivot,
so there are O(n) comparisons at each level, for a total of O(n log n) comparisons overall. Thats if we are lucky and pick the
middle element for the pivot every time.
How about if we are unlucky? Suppose we pick the largest element in every iteration.
Then the size of the list only decreased by one in every iteration, so there are n levels. The first level require n 1 comparison,
2
the second n 2 and so forth, so that the total number of comparisons is an arithmetic sequence and therefore is O(n ). This
is as bad as a naive algorithm like insertion sort. The natural question to ask then, how does quicksort behave on average? Is it
like the best case where the pivot is chosen in the middle, the worst case that we have here, or somewhere in-between?
2
E[ Xij ] =
ji+1
i<j
For the proof, observe that
In fact, the expectation of all zero-one variables is just the probability that the variable is equal to one.
The element a i has to be separated from a j by some pivot in the algorithm, and they wont be separated by two. Therefore,
fixing i and j,the events Eijk are disjoint, so
Pr(Xij = 1 Eijk ) =
k:Pr(Eijk )>0
and its the conditional probability that will be easiest to reason about.
Given that a i is going to be separated from aj , it must be that they havent been separated yet. So this list must include ai , a j ,
every element in-between possibly some more to the outside.
Given that the separation does occur, however, the pivot must be chosen in the range [a i , aj ]. The element a i will only be
compared to a j , however, if one of the two are chosen as the pivot. Therefore, given that the separation is going to occur, the
probability that is will actually require a comparison is only 2 divided by the number of possible choices for the pivot (j i + 1).
Substituting, this value here will give us the answer.
2
2
Pr( Eijk ) =
.
ji+1
ji+1
With the claim established, we argue that by the linearity of expectation the expected number of total comparisons is just the
sum of the expectations for
these X ij s.
n
2
1
X
E[ ij ] = j i + 1 j = O(n log n).
i<j
i<j
i=1 j=1
Its possible to do some tighter analysis, but for our purposes is enough to use this loose bound, which becomes just the sum of
n harmonic series. Summing 1/j is rather like integrating 1/x, so the inner sum becomes a log and we get O(n log n) in total.
Overall, then we can state our results as follows.
For any input of distinct elements quicksort with pivots chosen uniformly at random compares O(n log n) elements in
expectation.
The average case is on the same order as the worst case. This is comforting but by itself it is not necessarily a good guarantee
of performance. Its conceivable that the distribution of running times could be very spread out, so that it would be possible for
the running time to be a little better than the guarantee or potentially much worse.
It turns out that this is not case. The running times are actually quite concentrated around the expectation, meaning the one is
very unlikely to get running time much larger than the average. This sort of argument is called a concentration bound, and if you
ever take a full course on randomized algorithms, a good portion will be devoted to these types of arguments.
This set of edges S is called a minimum cut set. Note that this is a dierent problem from the minimum s-t cuts that we
considered in the context of the maximum flow. There are no two particular vertices that we are trying to separate hereany two
will do, and all the edges have equal weight here. We can use a the minimum s-t cut algorithm to help solve this problem, but I
think you will agree that this randomized algorithm is quite a bit simpler.
The algorithm operates by repeatedly contracting edges so as to join the two vertices together. Once there are only two vertices
left, they define the partition. (See video for animated example.)
Now, this particular choice of edges led to a minimum a cut set, but not all such choices would have. How then should we pick
an edge to contract?
It turns out that just picking a random one is a good idea. More often than not, this wont yield a correct answer, but as well see
its will be often enough to be useful.
Kargers minimum cut algorithm outputs a min-cut set with probability at least n(n1) .
Now at first, you might look at this result and ask what good is that? The algorithm doesnt even promise to be right more
often than not. The trick is that we can call the algorithm multiple times an just take this minimum of the result. If we do this
n(n 1) log(1/) times then there is a 1 chance that we will have found a minimum cut set.
The proof for this corollary is that each call to the algoirthm is independent, so the probability that all of the calls fail is given by
2
1
(
n(n 1) )
n(n1)
2
log(1/)
x
In general 1 x e , so applying that inequality, we have that the n(n1)/2 factors cancel in the exponent and we are left with
exp( log(1/)) = .
Thus, the probability that all the iterations fail is at most , so the chance that at least one them succeeds is 1 . This bound
here is extremely useful in analysis and is one that you should always have handy in your mental toolbox.
So if the theorem is true, we can boost it up into an eective algorithm just by repeating it, but why is the theorem true?
Consider a minimum cut set, and let Ei be the event that the edge chosen in iteration i of the algorithm in not in C. Note that
there could be other minimum cut sets as well. For the analysis, however, well just consider the probability of picking this
particular one.
Returning the cut-set C means not picking an edge in C in each iteration, so its the intersection of all the events Ei , which we
can turn into this product,
E
n3 Ei ) Pr(E2 |E1 ) Pr(E1 )
Pr( n2
i=1 ) = Pr( n2 | i=1
as weve done before. Were just conditioning the probability of avoiding C in the ith iteration given that weve avoid it in all
previous ones.
We now make the claim
Claim:
j
Pr(Ej+1 | i=1 Ej )
nj2
.
nj
Well warm-up just be considering, the probability of E1 of avoiding the cut in the first iteration.
Letting the size of the cut be k, we have that every vertex must have degree at least k. Otherwise, the edges incident on the
nk
smaller degree vertex would be smaller cut set. This then implies that |E| 2 every vertex must have degree at least k and
summing up the degrees for every vertex counts every edge exactly twice. Therefore, the probability of avoiding the cut set C is
Pr(E1 ) = 1
|C|
k
n2
1
=
|E|
nk/2
n
Pr(Ej+1 | i=1 Ei ) = 1
|C|
k
2
nj2
1
=1
=
.
|E|
(n j)k/2
nj
nj
as claimed.
Substituting, back into our equation here,
we see that we are down to a 1/3 probability in the last iteration, 2/4 in the iteration before than, etc. This product telescopes
2
and leaves us with the bound of n(n1) as claimed.
Altogether then, this extremely simple procedure has given us a fairly ecient algorithm for finding a minimum cut set.
Consider an assignment chosen uniformly at random. Define Yj to be 1 if the clause cj is satisfied and to be 0 otherwise. Since
all the literals come from distict variables, then there are eight possible ways of assignment them true or false values, but only
one of them will cause Yj to be equal to 0. The rest satisfy the clause and cause Yj to be 1. Thus, E[Yj ] = 7/8 for every j.
m
Now we consider the formula as a whole and let Y = j=1 Yj . Using the linearity of expectations, we have that
m
7
Yj] = m
v
Pr(Y
=
v)
=
E[Y]
=
E[
8
vY()
j=1
The key realization is that this value represents a kind of average. This mean that not all of the v in this sum on the left can be
less than the average on the right. There has to be a v where the probability is positive and the value of v 7m/8. Because this
probability is positive, however, there must be some assignment to the variables that achieves it. Therefore, there is always a
way to satisfy 7/8 of the clauses in any 3-SAT formula.
This technique of proof is called the expectation argument and it is part of a larger collection of very powerful tools called the
probabilistic method, which were developed and popularized by the famous Paul Erdos.
Lets say that we set variable x1 to True. Then any clauses not using x1 are left alone. If a clause has x1 in it, then it gets set to
True. If a clause has x1 in it, then we just eliminate that literal from the clause. If the x1 is the only literal in a clause, then we just
set the clause to False.
Another important subroutine, will be one that calculates the expected number clauses that will be satisfied if the remaining
variables are assigned True or False uniformly at random.
Of course, if a clause is just true, it gets assigned a value 1, false get assigned 0. A single literal gets assigned 1/2, two literals
gets a value of 3/4 and three literals gets a value of 7/8. Remember that there is just one way to assign the relevant variables to
that this clause is false. The EofY procedure simply calculates these values for every clause and sums them up.
With these subroutines defined, we can write down our derandomized algorithm as follows.
Start with an empty assignment to the variables. Then for each variable in turn, consider the formula resulting from its being set
to True and its being set to False. Between these two, pick the one that gives a larger expected value for the number of satisfied
clauses assuming that the remaining variables were set at random. Note that we are using our knowledge of how a random
assignment would behave here, but we arent actually using any randomization. Having picked the better of the two ways of
assigning the xi variable, we update the set of clauses record our assignment to xi .
The reason this algorithm works is that it maintains the invariant that the expected number of clauses of C that would be
satisfied if the remaining variable were assigned random is at least 7/8. This is true at the beginning, just by our previous
theorem. But this expectation for C is always just the average of the expected number that would be satisfied in Cp and Cn . So
by picking the new C to be the one for which this EofY quantity is larger, the invariant is maintained. Of course, at the end, all of
the variables have been assigned, so this computing the expectation of C amounts to just counting up the number of True
clauses. This technique is known as the method of conditional expectation and it has a number of clever applications.
Overall, then we have shown that there is a deterministic algorithm which give any 3-CNF formula finds an assignment that
satisfies 7/8 of the clauses. Remarkably, it turns our that this is the best we can do assuming P is not equal to NP. For this
argument, we turn to the celebrated and extremely popular PCP theorem.
then they become extremely ecient. These types of verifiers are called probabilistically checkable proof systems, and the
famous PCP theorem relates the set of languages they can verify under certain constraints back to the class NP.
In a course on complexity, we would place these proof systems within the larger context of other complexity classes and
interactive proof systems. For our purposes, however, the PCP theorem can be stated in this much more accessible way.Well
let denote the set of all 3CNF formulas. Remember that we are assuming that all clauses have exactly 3 literals and that they
come from 3 distinct variables. Then a version of the PCP theorem can be stated like this:
For any constant
> 7/8 , there is a polytime computable function f such that for every 3CNF formula that has a
2. if is not satisfiable, then every assignment of the variables satisfies fewer than an
So if is satisfiable, there is a way to satisfy all the clauses of f () . If is unsatisfiable, then you cant even get close to
satisfying all the clauses of f (). Weve introduced a gap here, and this gap is extremely useful for proving the hardness of
approximation.
Many, many hardness of approximation results follow from this theorem. The most straightforward of them, however, is that you
cant do better than the 7/8 algorithm for max-sat that we just went overnot unless P=NP. Why?
Well, suppose that I wanted to test whether strings were in some language in NP.
And at my disposal, I had a polytime alpha-approximation for 3-SAT, where alpha is strictly greater than 7/8.
Then, I could use the Cook-Levin reduction to transform my string into an instance of SAT that will be satisfiable if and only if x
is in L.
Then, I can use this f function from the PCP theorem to transform this into another 3SAT clauses where either all the clauses are
satisfiable or fewer than an alpha fraction of them are.
That way, I just run this approximation on f () and see if the fraction of clauses satisfied is at least than or not. If it is, then
from the PCP theorem, I can reason that phi must have been satisfiable, and so from the Cook-Levin reduction x must have
been in L.
On the other hand, if the fraction of satisfied clauses is less than , then f () cannot have been satisfiable, so must not have
been satisfiable, so from the Cook-Levin reduction x must not be in L.
Using this reasoning, we just found a way to decide an arbitrary language in NP in polynomial time. So if such an
approximation exists, then P=NP. Or more likely, P is not equal to NP, so no such approximation algorithm can exist.
Many hardness of approximation proofs can be done in a similar way. All thats necessary is to stick in another transformation
here, transforming the 3SAT problem that has this gap into another problem, which might have a potentially dierent gap, to
show that certain approximation factors would imply P=NP.
Conclusion
With that, the study of algorithms has brought us back to complexity, and unfortunately, it also brings us to the end of the
course.
At the beginning, we said that by following the sort of rigorous arguments that we would make, you would be giving yourself a
kind of mental training that would be useful beyond the classroom. We certainly hope that you have found this to be true and
had some fun along the way.
You may not remember everything taught in the course, but here are a few points that I do hope will stay with you forever.
1. There are some things you cant compute at all, like the halting problem.
2. There are some things that we cant compute quickly, like traveling salesman and every other NP-complete problem.
3. There are some things that we can compute eciently, and a little cleverness can go a long way like with the Fast Fourier
transform and maximum flows.
More important than the specific problems we tackled, were the ways to think about computational problems. You will come
across problems that you havent seen before and you can use the tools and techniques youve learned from this class to
determine the best course of action to find the proper algorithm. And if the problem is NP-complete, you neednt just give up