Lecture 1 Notes
Lecture 1 Notes
In this module we are going to look at designing algorithms. We will see how
they depend on the design of suitable data structures, and how some structures
and algorithms are more efficient than others for the same task. We’ll concentrate
on a few basic tasks, such as storing, sorting and searching data, that underlie much
of computer science, but the techniques discussed will be applicable much more
generally. We will start by studying some key data structures, such as arrays, lists,
queues, stacks and trees, and then move on to explore their use in a range of
different searching and sorting algorithms. This will lead us on to consider
approaches for the efficient storage of data in hash tables. Finally, we’ll look at
graph based representations and cover the kinds of algorithms needed to work
efficiently with them. Throughout, we’ll investigate the computational efficiency
of the algorithms we develop, and gain intuitions about the pros and cons of the
various potential approaches for each task. We will not restrict ourselves to
implementing the various data structures and algorithms in particular computer
programming languages (e.g., Java, C , OCaml), but specify them in simple
pseudocode that can easily be implemented in any appropriate language.
In this module we shall ignore such programming details, and concentrate on the
design of algorithms rather than programs. The task of implementing the discussed
algorithms as computer programs is left to the Software Workshop module, and
you will frequently see the same topics covered in both modules from different
perspectives. Having said that, you will often find it useful to write down segments
of actual programs in order to clarify and test certain algorithmic aspects. It is also
worth bearing in mind the distinction between different programming paradigms:
Imperative Programming describes computation in terms of instructions that
change the program/data state, whereas Declarative Programming specifies what
the program should accomplish without describing how to do it.
This module is primarily concerned with developing algorithms that map easily
onto the imperative programming approach. Algorithms can obviously be
described in plain English, and we will sometimes do that. However, for computer
scientists it is usually easier and clearer to use something that comes somewhere in
between formatted English and computer program code, but is not runnable
because certain details are omitted. This is called pseudocode. Often we will use
segments of psudocode that are very similar to the languages we are interested in,
e.g. the overlap of C and Java, with the advantage that they can easily be inserted
into runnable programs.
The details of these three aspects will usually be rather problem dependent.
The specification should formalize the crucial details of the problem that the
algorithm is trying to solve. Sometimes that will be based on a particular
representation of the associated data, sometimes it will be presented more
abstractly. Typically, it will have to specify how the inputs and outputs of the
algorithm are related, though there is no general requirement that the specification
is complete or non-ambiguous.
For simple problems, it is often easy to see that a particular algorithm will always
work, i.e. that it satisfies its specification. However, for more complicated
specifications and/or algorithms, the fact that an algorithm satisfies its specification
may not be obvious at all. In this case, we need to spend some effort verifying
whether the algorithm is indeed correct. In general, testing on a few particular
inputs can be enough to show that the algorithm is incorrect. However, since the
number of different potential inputs for most algorithms is infinite in theory, and
huge in practice, more than just testing on particular cases is needed to be sure that
the algorithm satisfies its specification. We need correctness proofs. Although we
will discuss proofs in this module, and useful relevant ideas like invariants, we will
usually only do so in a rather informal manner (though, of course, we will attempt
to be rigorous). The reason is that we want to concentrate on the data structures
and algorithms. Formal verification techniques are complex and will be taught in
later modules.
Finally, the efficiency or performance of an algorithm relates to the resources
required by it, such as how quickly it will run, or how much computer memory it
will use. This will usually depend on the problem instance size, the choice of data
representation, and the details of the algorithm. Indeed, this is what normally
drives the development of new data structures and algorithms. We shall study the
general ideas concerning efficiency in Chapter 5, and then apply them throughout
the remainder of the module.
Data structures, abstract data types, design patterns
At an even higher level of abstraction are design patterns which describe the
design of algorithms, rather the design of data structures. These embody and
generalize important design concepts that appear repeatedly in many problem
contexts. They provide a general structure for algorithms, leaving the details to be
added as required for particular problems. These can speed up the development of
algorithms by providing familiar proven algorithm structures that can be applied
straightforwardly to new problems. We shall see a number of familiar design
patterns throughout this module.
Overview
This module will cover the principal fundamental data structures and algorithms
used in computer science, and bring together a broad range of topics covered in
other modules into a coherent framework. Data structures will be formulated to
represent various types of information in such a way that it can be conveniently
and efficiently manipulated by the algorithms we develop. Throughout, the
recurring practical issues of algorithm specification, verification and performance
analysis will be discussed.
We shall begin by looking at some widely used basic data structures (namely
arrays, linked lists, stacks and queues), and the advantages and disadvantages of
the associated abstract data types. Then we consider the ubiquitous problem of
searching, and how that leads on to the general ideas of computational efficiency
and complexity. That will leave us with the necessary tools to study three
particularly important data structures: trees (in particular, binary search trees and
heap trees), hash tables, and graphs. We shall learn how to develop and analyse
increasingly efficient algorithms for manipulating and performing useful
operations on those structures, and look in detail at developing efficient processes
for data storing, sorting, searching and analysis. The idea is that once the basic
ideas and examples covered in this module are understood, dealing with more
complex problems in the future should be straightforward.
Arrays, Iteration, Invariants
Data is ultimately stored in computers as patterns of bits, though these days most
programming languages deal with higher level objects, such as characters, integers,
and floating point numbers. Generally, we need to build algorithms that manipulate
collections of such objects, so we need procedures for storing and sequentially
processing them.
Arrays
In computer science, the obvious way to store an ordered collection of items is as
an array. Array items are typically stored in a sequence of computer memory
locations, but to discuss them, we need a convenient way to write them down on
paper. We can just write the items in order, separated by commas and enclosed by
square brackets. Thus,
[1, 4, 17, 3, 90, 79, 4, 6, 81]
is an example of an array of integers. If we call this array a, we can write it as:
a = [1, 4, 17, 3, 90, 79, 4, 6, 81]
This array a has 9 items, and hence we say that its size is 9. In everyday life, we
usually start counting from 1. When we work with arrays in computer science,
however, we more often (though not always) start from 0. Thus, for our array a, its
positions are 0, 1, 2, . . . , 7, 8. The element in the 8th position is 81, and we use the
notation a[8] to denote this element. More generally, for any integer i denoting a
position, we write a[i] to denote the element in the i th position. This position i is
called an index (and the plural is indices). Then, in the above example, a[0] = 1,
a[1] = 4, a[2] = 17, and so on.
It is worth noting at this point that the symbol = is quite overloaded. In
mathematics, it stands for equality. In most modern programming languages, =
denotes assignment, while equality is expressed by ==. We will typically use = in
its mathematical meaning, unless it is written as part of code or pseudocode.
We say that the individual items a[i] in the array a are accessed using their index i,
and one can move sequentially through the array by incrementing or decrementing
that index, or jump straight to a particular item given its index value. Algorithms
that process data stored as arrays will typically need to visit systematically all the
items in the array, and apply appropriate operations on them.
For i = 1,...,N,
do something
In programming languages like C and Java this would be written as the for-loop
for( i = 0 ; i < N ; i++ )
{
// do something
}
in which a counter i keep tracks of doing “the something” N times. For example,
we could compute the sum of all 20 items in an array a using
for( i = 0, sum = 0 ; i < 20 ; i++ ) {
sum += a[i];
}
We say that there is iteration over the index i. The general for-loop structure is
for( INITIALIZATION ; CONDITION ; UPDATE )
{
REPEATED PROCESS
}
in which any of the four parts are optional. One way to write this out explicitly is
INITIALIZATION
if ( not CONDITION ) go to LOOP FINISHED
LOOP START
REPEATED PROCESS
UPDATE
if ( CONDITION ) go to LOOP START
LOOP FINISHED
In this module, we will regularly make use of this basic loop structure when
operating on data stored in arrays, but it is important to remember that different
programming languages use different syntax, and there are numerous variations
that check the condition to terminate the repetition at different points.
Invariants
An invariant, as the name suggests, is a condition that does not change during
execution of a given program or algorithm. It may be a simple inequality, such as
“i < 20”, or something more abstract, such as “the items in the array are sorted”.
Invariants are important for data structures and algorithms because they enable
correctness proofs and verification.
Structures and algorithms because they enable correctness proofs and verification.
In particular, a loop-invariant is a condition that is true at the beginning and end of
every iteration of the given loop. Consider the standard simple example of a
procedure that finds the minimum of n numbers stored in an array a:
minimum(int n, float a[n]) {
float min = a[0];
// min equals the minimum item in a[0],...,a[0]
for(int i = 1 ; i != n ; i++) {
// min equals the minimum item in a[0],...,a[i-1]
if (a[i] < min) min = a[i];
}
// min equals the minimum item in a[0],...,a[i-1], and i==n return min;
}
At the beginning of each iteration, and end of any iterations before, the invariant
“min equals the minimum item in a[0], ..., a[i − 1]” is true – it starts off true, and
the repeated process and update clearly maintain its truth. Hence, when the loop
terminates with “i == n”, we know that “min equals the minimum item in a[0], ...,
a[n − 1]” and hence we can be sure that min can be returned as the required
minimum value. This is a kind of proof by induction: the invariant is true at the
start of the loop, and is preserved by each iteration of the loop, therefore it must be
true at the end of the loop.
As we noted earlier, formal proofs of correctness are beyond the scope of this
module, but identifying suitable loop invariants and their implications for
algorithm correctness as we go through this module will certainly be a useful
exercise. We will also see how invariants (sometimes called inductive assertions)
can be used to formulate similar correctness proofs concerning properties of data
structures that are defined inductively.