06 Binsearch
06 Binsearch
06 Binsearch
Binary Search
1 Introduction
One of the fundamental and recurring problems in computer science is to
find elements in collections, such as elements in sets. An important al-
gorithm for this problem is binary search. We use binary search for an in-
teger in a sorted array to exemplify it. We started in the last lecture by
discussing linear search and giving some background on the problem. This
lecture clearly illustrates the power of order in algorithm design: if an array
is sorted we can search through it very efficiently, much more efficiently
than when it is not ordered.
We will also once again see the importance of loop invariants in writing
correct code. Here is a note by Jon Bentley about binary search:
I’ve assigned [binary search] in courses at Bell Labs and IBM. Profes-
sional programmers had a couple of hours to convert [its] description
into a program in the language of their choice; a high-level pseudocode
was fine. At the end of the specified time, almost all the programmers
reported that they had correct code for the task. We would then take
thirty minutes to examine their code, which the programmers did with
test cases. In several classes and with over a hundred programmers,
the results varied little: ninety percent of the programmers found bugs
in their programs (and I wasn’t always convinced of the correctness of
the code in which no bugs were found).
I was amazed: given ample time, only about ten percent of profes-
sional programmers were able to get this small program right. But
they aren’t the only ones to find this task difficult: in the history in
Section 6.2.1 of his Sorting and Searching, Knuth points out that
L ECTURE N OTES
Binary Search L6.2
while the first binary search was published in 1946, the first published
binary search without bugs did not appear until 1962.
—Jon Bentley, Programming Pearls (1st edition), pp.35–36
2 Binary Search
Can we do better than searching through the array linearly? If you don’t
know the answer already it might be surprising that, yes, we can do signif-
icantly better! Perhaps almost equally surprising is that the code is almost
as short!
Before we write the code, let us describe the algorithm. We start search-
ing for x by examining the middle element of the sorted array. If it is smaller
than x, then x must be in the upper half of the array (if it is there at all); if
it is greater than x, then x must be in the lower half. Now we continue by
restricting our attention to either the upper or lower half, again finding the
middle element and proceeding as before.
We stop if we either find x, or if the size of the subarray shrinks to zero,
in which case x cannot be in the array.
Before we write a program to implement this algorithm, let us analyze
the running time. Assume for the moment that the size of the array is a
power of 2, say 2k . Each time around the loop, when we examine the mid-
dle element, we cut the size of the subarrays we look at in half. So before the
first iteration the size of the subarray of interest is 2k . After the second iter-
ation it is of size 2k−1 , then 2k−2 , etc. After k iterations it will be 2k−k = 1,
so we stop after the next iteration. Altogether we can have at most k + 1
iterations. Within each iteration, we perform a constant amount of work:
computing the midpoint, and a few comparisons. So, overall, when given
a size of array n we perform c ∗ log2 (n) operations (for some constant c).1
1
In general in computer science, we are mostly interested in logarithm to the base 2 so
we will just write log(n) for log to the base 2 from now on unless we are considering a
L ECTURE N OTES
Binary Search L6.3
L ECTURE N OTES
Binary Search L6.4
In the body of the loop, we first compute the midpoint mid. By elemen-
tary arithmetic it is indeed between lo and hi .
Next in the loop body we check if A[mid ] = x. If so, we have found the
element and return mid .
Now comes the hard part. What is the missing part of the invariant?
The first instinct might be to say that x should be in the interval from A[lo]
to A[hi ]. But that may not even be true when the loop is entered the first
time.
Let’s consider a generic situation in the form of a picture and collect
some ideas about what might be appropriate loop invariants. Drawing
L ECTURE N OTES
Binary Search L6.5
diagrams to reason about an algorithm and the code that we are trying
to construct is an extremely helpful general technique.
0
1
2
3
4
5
6
7
8
9
10
A
5
7
11
19
34
42
65
65
89
123
0 lo hi n
The red box around elements 2 through 5 marks the segment of the ar-
ray still under consideration. This means we have ruled out everything to
the right of (and including) hi and to the left of (and not including) lo. Ev-
erything to the left is ruled out, because those values have been recognized
to be strictly less than x, while the ones on the right are known to be strictly
greater than x, while the middle is still unknown.
We can depict this as follows:
< x ? > x
0
1
2
3
4
5
6
7
8
9
10
A
5
7
11
19
34
42
65
65
89
123
0 lo hi n
We can summarize this by stating that A[lo − 1] < x and A[hi ] > x. This
implies that x cannot be in the segments A[0..lo) and A[hi ..n) because the
array is sorted (so all array elements to the left of A[lo − 1] will also be less
than x and all array elements to the right of A[hi ] will also be greater than
x). For an alternative, see Exercise 2.
We can postulate these as invariants in the code.
L ECTURE N OTES
Binary Search L6.6
L ECTURE N OTES
Binary Search L6.7
In the code we blithely wrote A[lo − 1] and A[hi ] because they were
in the middle of the array in our diagram. But initially (and potentially
through many iterations) this may not be the case. Fortunately, it is easy to
fix, following what we did for linear search. Consider the following picture
when we start the search.
0
1
2
3
4
5
6
7
8
9
10
A
5
7
11
19
34
42
65
65
89
123
0 = lo hi = n
L ECTURE N OTES
Binary Search L6.8
At this point, let’s check if the loop invariant is strong enough to imply
the postcondition of the function. If we return from inside the loop because
A[mid ] = x we return mid , so A[\result] == x as required.
If we exit the loop because lo < hi is false, we know lo = hi , by the first
loop invariant. Now we have to distinguish some cases.
1. If A[lo − 1] < x and A[hi ] > x, then A[lo] > x (since lo = hi ). Because
the array is sorted, x cannot be in it.
Notice that we could verify all this without even knowing the complete
program! As long as we can finish the loop to preserve the invariant and
terminate, we will have a correct implementation! This would again be a
good point for you to interrupt your reading and to try to complete the
loop, reasoning from the invariant.
We have already tested if A[mid ] = x. If not, then A[mid ] must be less or
greater than x. If it is less, then we can keep the upper end of the interval as
is, and set the lower end to mid + 1. Now A[lo − 1] < x (because A[mid ] < x
and lo = mid + 1), and the condition on the upper end remains unchanged.
If A[mid ] > x we can set hi to mid and keep lo the same. We do not
need to test this last condition, because the fact that the tests A[mid ] = x
and A[mid ] < x both failed implies that A[mid ] > x. We note this in an
assertion.
L ECTURE N OTES
Binary Search L6.9
L ECTURE N OTES
Binary Search L6.10
Init: When the loop is first reached, we have lo = 0 and hi = n, so the first
loop invariant follows from the precondition to the function. Further-
more, the first disjunct in loop invariants two (lo == 0) and three
(hi == n) is satisfied.
Preservation: Assume the loop invariants are satisfied and we enter the
loop:
0 ≤ lo ≤ hi ≤ n (Inv 1)
(lo = 0 or A[lo − 1] < x) (Inv 2)
(hi = n or A[hi ] > x) (Inv 3)
lo < hi (loop condition)
We compute mid = lo +b(hi −lo)/2c. Now we distinguish three cases:
L ECTURE N OTES
Binary Search L6.11
4 Termination
Does this function terminate? If the loop body executes, that is, lo < hi ,
then the interval from lo to hi is non-empty. Moreover, the intervals from
lo to mid and from mid + 1 to hi are both strictly smaller than the original
interval. Unless we find the element, the difference between hi and lo must
eventually become 0 and we exit the loop.
but that is in fact incorrect. Consider this change and try to find out why
this would introduce a bug.
L ECTURE N OTES
Binary Search L6.12
Were you able to see it? It’s subtle, but somewhat related to other prob-
lems we had. When we compute (lo + hi)/2; we could actually have an
overflow, if lo + hi > 231 − 1. This is somewhat unlikely in practice, since
231 = 2G, about 2 billion, so the array would have to have at least 1 billion
elements. This is not impossible, and, in fact, a bug like this in the Java
libraries2 was actually exposed.
Fortunately, the fix is simple: because lo < hi , we know that hi − lo > 0
and represents the size of the interval. So we can divide that in half and
add it to the lower end of the interval to get its midpoint.
Let us convince ourselves why the assert is correct. The division by two
will round to zero, which will round down to 0 here, because hi − lo > 0.
Thus, 0 ≤ (hi − lo)/2 < hi − lo, because dividing a positive number by two
will make it strictly smaller. Hence,
L ECTURE N OTES
Binary Search L6.13
6 Some Measurements
Algorithm design is an interesting mix between mathematics and an ex-
perimental science. Our analysis above, albeit somewhat preliminary in
nature, allow us to make some predictions of running times of our imple-
mentations. We start with linear search. We first set up a file to do some
experiments. We assume we have already tested our functions for correct-
ness, so only timing is at stake. See the file search-time.c0 in the code
directory for this lecture. We compile this file, together with our imple-
mentation from this lecture, with the cc0 command below. We can get an
overall end-to-end timing with the Unix time command. Note that we do
not use the -d flag, since that would dynamically check contracts and com-
pletely throw off our timings.
When running linear search 2000 times (1000 times with x in the array, and
1000 times with random x) on 218 elements (256 K elements) we get the
following answer
Timing 1000 times with 2^18 elements
0
4.602u 0.015s 0:04.63 99.5% 0+0k 0+0io 0pf+0w
which indicates 4.602 seconds of user time.
Running linear search 2000 times on random arrays of size 218 , 219 and
220 we get the timings on our MacBook Pro
L ECTURE N OTES
Binary Search L6.14
L ECTURE N OTES
Binary Search L6.15
Exercises
Exercise 1 Rewrite the binary search function so that both lower and upper bounds
of the interval are inclusive. Make sure to rewrite the loop invariants and the loop
body appropriately, and prove the correctness of the new loop invariants. Also
explicitly prove termination by giving a measure that strictly decreases each time
around the loop and is bounded from below.
Exercise 2 Rewrite the invariants of the binary search function to use is in(x, A, l, u)
which returns true if and only if there is an i such that x = A[i] for l ≤ i < u.
is in assumes that 0 ≤ l ≤ u ≤ n where n is the length of the array.
Then prove the new loop invariants, and verify that they are strong enough to
imply the function’s postcondition.
Exercise 3 Binary search as presented here may not find the leftmost occurrence
of x in the array in case the occurrences are not unique. Given an example demon-
strating this.
Now change the binary search function and its loop invariants so that it will
always find the leftmost occurrence of x in the given array (if it is actually in the
array, −1 as before if it is not).
Prove the loop invariants and the postconditions for this new version, and
verify termination.
Exercise 4 If you were to replace the midpoint computation by
int mid = (lo + hi)/2;
then which part of the contract will alert you to a flaw in your thinking? Why?
Give an example showing how the contracts can fail in that case.
Exercise 5 In lecture, we used design-by-invariant to construct the loop body im-
plementation from the loop invariant that we have identified before. We could also
have maintained the loop invariant by replacing the whole loop body just with
// .... loop_invariant elided ....
{
lo = lo;
hi = hi;
}
Prove the loop invariants for this loop body. What is wrong with this choice?
Which part of our proofs fail, thereby indicating why this loop body would not
implement binary search correctly?
L ECTURE N OTES