to-print-data-structures-and-algorithms-class-notes
to-print-data-structures-and-algorithms-class-notes
3 Tools
Use
Specification
CS315 Class Notes Implementation
Raphael Finkel
4 Singly-linked list
May 4, 2021
• used as a part of several ADTs.
• Can be considered an ADT itself.
1 Intro • Collection of nodes, each with optional arbitrary data and a pointer
to the next element on the list.
Class 1, 1/26/2021
handle
• Handout 1 — My names
• TA:
a c x f
• Plagiarism — read aloud
data pointer null pointer
1
CS315 Spring 2021 3 CS315 Spring 2021 4
6 Improving the efficiency of some operations • Exercise: Is it easy to add a new node after a given node?
• Exercise: Is it easy to add a new node before a given node?
• To make count() fast: maintain the count in a separate variable. If we
need the count more often than we insert and delete, it is worthwhile.
• To make insert at rear fast: maintain two handles, one to the front, 7 Aside: Unix pipes
the other to the rear of the list.
• Unix programs automatically have three “files” open: standard in-
• Combine these new items in a header node:
put, which is by default the keyboard, standard output, which is by
1 typedef struct {
default the screen, and standard error, which is by default the screen.
2 node *front;
3 node *rear; • In C and C++, they are defined in stdio.h by the names stdin,
4 int count; stdout, and stderr.
5 } nodeHeader; • The command interpreter (in Unix, it’s called the “shell”) lets you
invoke programs redirecting any or all of these three. For instance,
• Class 2, 1/28/2021 ls | wc redirects stdout of the ls program to stdin of the wc
program.
• To make search faster: remove the special case that we reach the end
of the list by placing a pseudo-data node at the end. Keep track of • If you run your trains program without redirection, you can type
the pseudo-data node in the header. in arbitrary numbers.
1 typedef struct { • If you run randGen.pl without redirection, it generates an unbounded
2 node *front; list of pseudo-random numbers to stdout.
3 node *rear; • If you run randGen.pl | trains, the list of numbers from randGen.pl
4 node *pseudo; is redirected as input to trains.
5 int count;
6 } nodeHeader;
7 8 Stacks, queues, dequeues: built out of either
8 node *searchDataIterative(nodeHeader *header, int data) { linked lists or arrays
9 // iterative method
10 header->pseudo->data = data; • We’ll see each of these.
11 node *current = header->front;
12 while (current->data != data) {
13 current = current->next; 9 Stack of integer
14 }
15 return (current == header->pseudo ? NULL : current); • Abstract definition: either empty or the result of pushing an integer
16 } // searchDataIterative onto the stack.
• Exercise: If we want both pseudo-data and a rear pointer, how does • operations
an empty list look? • stack makeEmptyStack()
• Exercise: If we want pseudo-data, how does searchDataRecursive() • boolean isEmptyStack(stack S)
change? • int popStack(stack *S) // modifies S
CS315 Spring 2021 7 CS315 Spring 2021 8
• void pushStack(stack *S, int I) // modifies S • Warning: it’s easy to make off-by-one errors.
1 #define MAXSTACKSIZE 10
#include <stdlib.h>
10 Implementation 1 of Stack: Linked list 2
9 stack *makeEmptyStack() {
10 stack *answer = (stack *) malloc(sizeof(stack));
11 Implementation 2 of Stack: Array 11 answer->count = 0;
12 return answer;
• Class 3, 2/2/2021 13 } // makeEmptyStack
14
• The array implementation limits the size. Does the linked-list imple-
mentation also limit the size?
• The array implementation needs one cell per (potential) element, and
one for the count. How much space does the linked-list implemen-
tation need?
• We can position two opposite-sense stacks in one array so long as
CS315 Spring 2021 9 CS315 Spring 2021 10
4 int data;
5 struct node_s *next;
• Abstract definition: either empty or the result of inserting an integer
6 } node;
at the rear of a queue or deleting an integer from the front of a queue.
7
• Representation 3: Array
1 // warning: it’s easy to make off-by-one errors. 17 Quadratic search: set mid based on discrep-
2 bool binarySearch(int target, int *array, ancy
3 int lowIndex, int highIndex) {
4 // look for target in array[lowIndex..highIndex] Also called interpolation search, extrapolation search, dictionary search.
5 while (lowIndex < highIndex) { // at least 2 elements
1 bool quadraticSearch(int target, int *array,
6 int mid = (lowIndex + highIndex) / 2; // round down
2 int lowIndex, int highIndex) {
7 if (array[mid] < target) lowIndex = mid + 1;
3 // look for target in array[lowIndex..highIndex]
8 else highIndex = mid;
4 while (lowIndex < highIndex) { // at least 2 elements
9 } // while at least 2 elements
5 if (array[highIndex] == array[lowIndex]) {
10 return (array[lowIndex] == target);
6 highIndex = lowIndex;
11 } // binarySearch
7 break;
8 }
9 float percent = (0.0 + target - array[lowIndex])
10 / (array[highIndex] - array[lowIndex]);
11 int mid = int(percent * (highIndex-lowIndex)) + lowIndex;
12 if (mid == highIndex) {
13 mid -= 1;
14 }
15 if (array[mid] < target) {
16 lowIndex = mid + 1;
17 } else {
18 highIndex = mid;
19 }
20 } // while at least 2 elements
21 return(array[lowIndex] == target);
22 } // quadraticSearch
Experimental results
• Bad news: any comparison-based searching algorithm is Ω(log n), 16 treeNode *searchTree(treeNode *tree, int key) {
that is, needs at least on the order of log n steps. 17 if (tree == NULL) return(NULL);
18 else if (tree->data == key) return(tree);
• Notation, slightly more formally defined. All these ignore multi- 19 else if (key <= tree->data)
plicative constants. 20 return(searchTree(tree->left, key));
• O(f (n)): no worse than f (n); at most f (n). 21 else
22 return(searchTree(tree->right, key));
• Ω(f (n)): no better than f (n); at least f (n).
23 } // searchTree
• Θ(f (n)): no better or worse than f (n); exactly f (n). 24
20 Traversals • insert(data) and search(data) are O(log n), but we can generally treat
them as O(1).
• A traversal walks through the tree, visiting every node.
• We will discuss hashing later.
• Symmetric traversal (also called inorder)
void symmetric(treeNode *tree) {
1
• We can also compute the cost using the recursion theorem (page 17): 1 int partition(int array[], int lowIndex, int highIndex) {
• cn = n + cn2 (if we are lucky) 2 // modifies array, returns pivot index.
3 int pivotValue = array[lowIndex];
• cn = n + c2n3 (fairly average case)
4 int divideIndex = lowIndex;
• f (n) = n = O(n1 ) 5 for (int combIndex = lowIndex+1; combIndex <= highIndex;
• k = 1, a = 1, b = 2 (or b = 32) 6 combIndex += 1) {
• a < bk 7 // array[lowIndex] is the partitioning (pivot) value.
• so cn = Θ(nk ) = Θ(n) 8 // array[lowIndex+1 .. divideIndex] are < pivot
9 // array[divideIndex+1 .. combIndex-1] are ≥ pivot
10 // array[combIndex .. highIndex] are unseen
23 Partitioning an array 11 if (array[combIndex] < pivotValue) { // see a small value
12 divideIndex += 1;
• Nico Lomuto’s method 13 swap(array, divideIndex, combIndex);
• Online demonstration. 14 }
15 } // each combIndex
• The method partitions array[lowIndex .. highIndex] into 16 // swap pivotValue into its place
three pieces: 17 swap(array, divideIndex, lowIndex);
18 return(divideIndex);
• array[lowIndex .. divideIndex -1]
19 } // partition
• array[divideIndex]
• array[divideIndex + 1 .. highIndex] • Example
The elements of each piece are in order with respect to adjacent pieces.
CS315 Spring 2021 23 CS315 Spring 2021 24
smallest
• n iterations.
• Iteration i may need to search through n − i places.
• ⇒ O(n2 ).
• Experimental results for Selection Sort:
compares + moves ≈ n.
CS315 Spring 2021 27 CS315 Spring 2021 28
n compares moves n2 2
100 4950 198 5000 random
200 19900 398 20000
400 79800 798 80000
partition
1 void selectionSort(int array[], int length) {
2 // array goes from 0..length-1
3 int combIndex, smallestValue, bestIndex, probeIndex;
small big
4 for (combIndex = 0; combIndex < length; combIndex += 1) {
5 // array[0 .. combIndex-1] has lowest elements, sorted.
6 // Find smallest other element to place at combIndex.
sort sort
7 smallestValue = array[combIndex];
8 bestIndex = combIndex;
9 for (probeIndex = combIndex+1; probeIndex < length;
10 probeIndex += 1) {
11 if (array[probeIndex] < smallestValue) {
12 smallestValue = array[probeIndex]; • about log n depth.
13 bestIndex = probeIndex;
• each depth takes about O(n) work.
14 }
15 } • ⇒ O(n log n).
16 swap(array, combIndex, bestIndex); • Can be unlucky: O(n2 ).
17 } // for combIndex
• To prevent worst-case behavior, partition based on median of 3 or 5.
18 } // selectionSort
• Don’t Quicksort small regions; use a final insertionSort pass instead.
• Not stable, because the swap moves an arbitrary value into the un- Experiments how that the optimal break point depends on the im-
sorted area. plementation, but somewhere between 10 and 100 is usually good.
• Experimental results for QuickSort:
compares + moves ≈ 24 n log n.
29 Quicksort (C. A. R. Hoare) n compares moves n log n n2 2
100 643 824 664 5000
• Class 8, 2/18/2021
200 1444 1668 1528 20000
• Recursive based on partitioning: 400 3885 4228 3457 80000
800 8066 8966 7715 320000
1600 17583 18958 17030 1280000
3 void siftUp (int heap[], int subjectIndex) { 3 void insertInHeap (int heap[], int *heapSize, int value) {
4 // the element in subjectIndex needs to be sifted up. 4 *heapSize += 1; // should check for overflow
5 heap[0] = heap[subjectIndex]; // pseudo-data 5 heap[*heapSize] = value;
6 while (1) { // compare with parentValue. 6 siftUp(heap, *heapSize);
7 int parentIndex = subjectIndex / 2; 7 } // insertInHeap
8 if (heap[parentIndex] <= heap[subjectIndex]) return; 8
• That formula approaches n as j → ∞ • Pass 2: examine values in bin order, and in list order within bins,
placing them in a new copy of bins based on the second-to-last digit.
• Total complexity is therefore O(n + n log n) = O(n log n).
• Pass 3, 4: similar.
• This sorting method is not stable, because sifting does not preserve
order. • The number of digits is O(log n), so there are O(log n) passes, each
of which takes O(n) time, so the algorithm is O(n log n).
Experimental results for Heap Sort: compares + moves ≈ 31n log n. • This sorting method is stable.
n compares moves n log n n2 2
100 755 1190 664 5000
200 1799 2756 1528 20000 34 Merge sort
400 4180 6196 3457 80000
800 9621 14050 7715 320000 1 void mergeSort(int array[], int lowIndex, int highIndex){
1600 21569 31214 17030 1280000 2 // sort array[lowIndex] .. array[highIndex]
3 if (highIndex - lowIndex < 1) return; // width 0 or 1
4 int mid = (lowIndex+highIndex)/2;
32 Bin sort 5 mergeSort(array, lowIndex, mid);
6 mergeSort(array, mid+1, highIndex);
• Assumptions: values lie in a small range; there are no duplicates. 7 merge(array, lowIndex, highIndex);
• Storage: build an array of bins, one for each possible value. Each is 8 } // mergeSort
1 bit long. 9
Experimental results for Merge Sort: compares + moves ≈ 29n log n. • walk up the tree from n , rotating as needed to restore color
n compares moves n log n n2 2 rules. O(log n).
100 546 1344 664 5000 • color the root black.
200 1286 3088 1528 20000
case 1: parent and uncle red
400 2959 6976 3457 80000
Circled: black; otherwise: red
800 6741 15552 7715 320000
1600 15017 34304 17030 1280000 g g*
Star: continue up the tree here
recolor
p u p u
35 Red-black trees (Guibas and Sedgewick 1978)
c* c
• Class 10, 3/2/2021 case 2: parent red, uncle black, c inside
• Red-black trees balance themselves during online insertion.
g g
• Their representation requires pointers both to children and to the rotate c up
parent.
u p p down u c
• Each node is red or black.
continue to case 3
• The pseudo-nodes (or null nodes) at bottom are black. c* c3 c1 p*
• The root node is black.
c1 c2 c2 c3
• Red nodes have only black children. So no path has two red nodes
in a row. case 3: parent red, uncle black, c outside
• All paths from the root to a leaf have the same number of black g g p
nodes. recolor rotate p up
• To keep the tree acceptable, we sometimes rotate, which reorganizes • try with values 1..6:
the tree locally without changing the symmetric traversal.
y x
right
x c a y
left
a b b c
• To insert
• place new node n in the tree and color it red. O(log n).
CS315 Spring 2021 37 CS315 Spring 2021 38
final • By example.
1 color 1 case 3 1 rotate 2 case 1
2 2 1 3 • The depth of a balanced ternary tree is log3 n, which is only 63% the
color color
depth of a balanced binary tree.
3 3 4
• The number of comparisons needed to traverse an internal node dur-
ing a search is either 1 or 2; average 5/3.
2 2 2 • So the number of comparisons to reach a leaf is 53 log3 n instead of
case 3
1 3 1 3 rotate 1 4 case 1 (for a binary tree) log2 n, a ratio of 1.05, indicating a 5% degradation.
color
4 4 3 5 color • The situation gets only worse for larger arity. For quaternary trees,
5 5
the degradation (in comparison to binary trees) is about 12.5%.
6
• And, of course, an online construction is not balanced.
• Moral: binary is best; higher arity is not helpful.
2
1 4
3 5
38 Quad trees (Finkel 1973)
6 • Extension of sorted binary trees to two dimensions.
• try with these values: 5, 2, 7, 4 (case 1), 3 (case 2), 1 (case 1) • Internal nodes contain a discriminant, which is a two-dimensional
(x,y) value.
• Internal nodes have four children, corresponding to the four quad-
36 Review of binary trees rants from the discriminant.
• Binary trees have expected O(log n) depth, but they can have O(n) • Leaf nodes contain a bucket of b values.
depth. • Insertion
• insertion
• Dive down the tree, put new value in its bucket.
• traversal: preorder, postorder, inorder=symmetric order. • If the bucket overflows, pick a good discriminant and subdi-
• deletion of node D vide.
• Good discriminant: one that separates the values as evenly as
• If D is a leaf, remove it.
possible. Suggestion: median (x, y) values.
• If D has one child C, move C in place of D.
• If D has two children, find its successor: S = RL∗ . Move S in • Offline algorithm to build a balanced tree
place of D. S has no left child, but if it has a right child C, move
• Put all elements in a single bucket, then recursively subdivide
C in place of S.
as above.
• Heavily used in 3-d modeling for graphics, often with discriminant • Used for cluster analysis, categorizing (as in optical character recog-
chosen as midpoint, not median. nition).
39 k-d trees (Bentley and Finkel 1973) 40 2-3 trees (John Hopcroft, 1970)
• Extension of sorted binary trees to d dimensions. • Class 12, 3/9/2021
• Especially good when d is high. • By example.
• Internal nodes contain a dimension number (0 d − 1) and a dis- • Like a ternary tree, but different rule of insertion
criminant value (real).
• Always completely balanced
• Internal nodes have two children, corresponding to values ≤ and >
• A node may hold 1, 2, or 3 (temporarily) values.
the discriminant in the given dimension.
• A node may have 0 (only leaves), 2, 3, or 4 (temporarily) children.
• Leaf nodes contain a bucket of b values.
• A node that has 3 values splits and promotes its middle value to its
• Offline construction and online insertion are similar to quad trees.
parent (recursively up the tree).
• To split a bucket of values, pick the dimension number with the • If the root splits, it promotes a new root.
largest range across those values.
• Complexity: O(n log n) for insertion and search, guaranteed.
• Given the dimension, pick the median of the values in that di-
mension as the discriminant. • Deletion: unpleasant.
• That choice of dimension number tends to make the domain
of each bucket roughly cubical; that choice of discriminant bal-
ances the tree.
41 Stooge Sort
• Nearest-neighbor search: Given a d-dimensional probe value p, to • A terrible method, but fun to analyze.
find the nearest neighbor to p that is in the tree.
mh+1 −1
• cn = 1 + 3c2n3 • Number of nodes a = 1 + m + m2 + · · · + mh = m−1
.
• Number of keys n = (m − 1)a = mh+1 − 1 ⇒ logm (n + 1) =
• a = 3, b = 32, k = 0, so bk = 1. By the recursion theorem (page 18),
h + 1 ⇒ h is O(log n).
since a > bk , we have complexity Θ(nlogb a ) = Θ(nlog3/2 3 ) ≈ Θ(n271 ), so
Stooge Sort is worse than quadratic. • A sparsely filled tree with n keys (values), height h:
• However, the recursion often encounters already-sorted sub-arrays.
• The root has two subtrees; the others have g = dm2e subtrees,
If we add a check for that situation, Stooge Sort becomes roughly
so:
quadratic. h −1)
• Number of nodes a = 1 + 2(1 + g + g 2 + · · · + g h−1 ) = 1 + 2(gg−1 .
• The root has 1 key, the others have g − 1 keys, so:
42 B trees (Ed McCreight 1972) • Number of keys n = 1+2(g h −1) = 2g h −1 ⇒ h = logg (n+1)2 =
O(log n).
• A generalization of 2-3 trees when McCreight was at Boeing, hence
the name.
• Choose a number m (the bucket size) such that m values plus m 43 Deletion from a B tree
disk indices fit in a single disk block. For instance, if a block is 4KB,
a value takes 4B, and an index takes 4B, then m = 4KB8B = 512. • Deletion from an internal node: replace value with successor (taken
from a leaf), and then proceed to deletion from a leaf.
• m = 3 ⇒ 2-3 tree.
• Deletion from a leaf: the bad case is that it can cause underflow: the
• Class 13, 3/11/2021 leaf now has fewer than g keys.
• Each node has 1 m − 1 values and 0 m children. (We have room • In case of underflow, borrow a value from a neighbor if possible,
for m values; the extra can be used for pseudo-data.) adjusting the appropriate key in the parent.
CS315 Spring 2021 43 CS315 Spring 2021 44
• If all neighbors (there are 1 or 2) are already minimal, grab a key from • These methods suffer from clustering.
the parent and also merge with a neighbor. • Deletion is hard, because removing an element can damage un-
• In general, deletion is quite difficult. related searches. Deletion by marking is the only reasonable
approach.
44 Hashing • Perfect hashing: if you know all n values in advance, you can look
for a non-colliding hash function h. Finding such a function is in
• Very popular data structure for searching. general quite difficult, but compiler writers do sometimes use perfect
hashing to detect keywords in the language (like if and for).
• Cost of insertion and of search is O(log n), but only because n distinct
values must be log n bits long, and we need to look at the entire key. • Linear probing. Probe p is at index h(k) + p (mod s), for p = 0, 1, .
If we consider looking at a key to be O(1), then hashing is expected • Terrible behavior when A[ ] is almost full, because chains coa-
(but not guaranteed) to be O(1). lesce. This problem is called “primary clustering”.
• Idea: find the value associated with key k at A[h(k)], where
• Additional hash functions. Use a family of hash functions, h1 (), h2 (), .
• h() maps keys to integers in 0s − 1, where s is the size of A[ ].
• insertion: key probing with different functions until an empty
• h() is “fast”. (It generally needs to look at all of k, though.)
slot is found.
• Example • searching: probe with different functions until you find the key
(success) or an empty slot (failure).
• k = student in class.
• You need a family of independent hash functions.
• h(k) = k’s birthday (a value from 0 .. 365).
• The method is very expensive when A[ ] is almost full.
• Difficulty: collisions
• 2-3 tree. Preorder result: 3 1 1 (2, 3) 5 (4, 5) (6, 9) 50 Hashing: Dealing with collisions: external chain-
• red-black tree. Preorder result: 3b 1 1b 2b 3 5 4b 5 9b 6 ing
• Each element in A is a pointer, initially null, to a bucket, which is a
47 Midterm exam linked list of nodes that hash to that element; each node contains k
and any other associated data.
Class 15, 3/18/2021
• insert: place k at the front of A[h(k)].
• search: look through the list at A[h(k)].
48 Midterm exam follow-up
• optimization: When you find, promote the node to the start of
Class 16, 3/23/2021 its list.
52 Hashing: How big should the array be? • Resizing the array is automatic, although one might specify the ex-
pected size in advance to avoid resizing during early growth.
• Some open-addressing methods prefer that s = Array be prime.
• Perl has a built-in datatype called a hash.
• Computing h() is faster if s = 2j for some j.
• Open addressing gets very bad if s < 2n, depending on method. 1 my %foo;
Linear probing is the worst; I would make sure s ≥ 3n. 2 foo{"this"} = "that".
• External chaining works fine even when s ∼ = n, but it gets steadily
• Python has dictionaries.
worse.
1 Foo = dict()
53 Hashing: What should we do if we discover 2 Foo[’this’] = ’that’;
that s is too small? • JavaScript arrays are all associative.
• We can rebuild with a bigger s, rehashing every element. But that
operation causes a temporary “outage”, so it is not acceptable for 1 const foo = [];
online work. 2 foo[’this’] = ’that’;
3 foo.this = ’that’;
• Extendible hashing
• Start with one bucket. If it gets too full (list longer than 10, say),
split it on the last bit of h(k) into two buckets. 55 Cryptographic hashes: digests
• Whenever a bucket based on the last j bits is too full, split it
based on bit j + 1 from the end. • purpose: uniquely identify text of any length.
• To find the bucket • these hashes are not used for searching.
• compute v = h(k). • goals
• follow a tree that discriminates on the last bits of v. This
tree is called a trie. • fast computation
• it takes at most log v steps to find the right bucket. • uninvertable: given h(k), it should be infeasible to compute k.
• Searching within the bucket now is guaranteed to take con- • it should be infeasible to find collisions k1 and k2 such that h(k1 ) =
stant time (ignoring the log n cost of comparing keys) h(k2 ).
• examples
54 Hash tables (associative arrays) in scripting • MD5: 128 bits. Practical attack in 2008.
languages • SHA-1: 160 bits, but (2005) one can find collisions in 269 hash
operations (brute force would use 280 )
• Class 18, 3/30/2021 • SHA-2: usual variant is SHA256; also SHA-512.
• Like an array, but the indices are strings.
• uses
CS315 Spring 2021 49 CS315 Spring 2021 50
56 Graphs
• Our standard graph:
1 e1 2 C
e2 Can you find an Eulerian cycle?
e4 e5 3 4 e3 5
• Family trees. These graphs are bipartite: Family nodes and per-
6 7
e7 son nodes. We might want to find the shortest path between
e6 two people.
• Cities and roadways, with weights indicating distance. We might
• Nomenclature
want a minimal-cost spanning tree.
• vertices: V is the name of the set, v is the size of the set. In our
example, V = 1, 2, 3, 4, 5, 6, 7.
• edges: E is the name of the set, e is the size of the set. In our
57 Data structures representing a graph
example, E = e1, e2, e3, e4, e5, e6, e7.
• Adjacency matrix
• directed graph: edges have direction (represented by arrows).
• undirected graph: edges have no direction. • an array n × n of Boolean.
• multigraph: more than one edge between two vertices. We gen- • A[i, j] = true ⇒ there is an edge from vertex i to vertex j.
erally do not deal with multigraphs, and the word graph gen- 1 2 3 4 5 6 7
erally disallows them. 1 x x
• weighted graph: each edge has numeric label called its weight. 2 x x x
3 x x
• Graphs represent situations 4 x
5 x
• streets in a city. We might be interested in computing paths. 6 x x
• airline routes, where the weight is the price of a flight. We might 7 x x x
be interested in minimal-cost cycles. • The array is symmetric if the graph is undirected
• Hamiltonian cycle: no duplicated vertices (cities). • in this case, we can store only one half of it, typically in a
• Eulerian cycle: no duplicated edges (flights). 1-dimensional array
• Islands and bridges, as in the bridges of Königsburg, later called • A[i(i − 1)2 + j] holds information about edge i, j.
Kaliningrad (Euler 1707-1783). This is a multigraph, not strictly • Instead of Boolean, we can use integer values to store edge weights.
CS315 Spring 2021 51 CS315 Spring 2021 52
1 2
1 void weightedBFS(vertex start, vertex goal) { 80
2 // assume visited[*] == () at start
3 workHeap = makeHeap(); // top-light 40 100
4 insertInHeap(workHeap, (0, start, start)); path length
20
5 // distance, vertex, from where 60 3 4 5 →0
6 while (! emptyHeap(workHeap)) { 56 → 40
60
7 (distance, place, from) = deleteFromHeap(workHeap); 120 120 53 120
8 if (visited[place] != ()) continue; // already seen 51 60
9 visited(place) = (from, distance); 51 → 60
10 if (place == goal) return; // could print path 5 40 6 563 100 better way to add vertex 3
11 foreach (neighbor, weight) in (successors(place)) { 564 160
12 insertInHeap(workHeap, (distance+weight, neighbor, place)); 512 140
13 } // foreach neighbor 563 → 100
14 } // while queue not empty 564 160
15 } // BFS 512 140
5634 → 120 better way to add vertex 4
• Another example:
63 Dijkstra’s algorithm: Finding all shortest paths 1
from a given vertex in a weighted graph 30 30 path length
1 →0
The weights must be positive. Weiss §9.3.2 2 10 3 20 4 12 →3
10 14 →3
• Rule: among all vertices that can extend a shortest path already 40 30 123 →4
found, choose the one that results in a shortest path. If there is a 5 Start at 1. 1 2 5 7
tie ending at the same vertex, choose either. If there is a tie going to 143 5
different vertices, choose both. 145 6
• This is an example of a greedy algorithm: at each step, improve the 125 7
solution in the way that looks best at the moment. 145 6
1235 →5
• Starting position: one path, length 0, from start vertex j to j.
64 Topological sort
• Sample application: course prerequisites place some pairs of courses
in order, leading to a directed, acyclic graph (DAG). We want to find
a total order; there may be many acceptable answers.
• Weiss §9.2
CS315 Spring 2021 57 CS315 Spring 2021 58
115
connected subgraph containing all the original vertices.
40 40
215 1 2 1 2
1 2 3
216 275
20 50
4 5 6 30 30 20
280 315
10 10
7 8 9 470 335 405
3 4 3 4
60 30
10 471 60
Possible results: 5 20 6 5 20 6
1 4 10 1 2 6 5 7 8 3 9 • Minimum-weight panning tree: Given a connected undirected weighted
2 10 4 7 1 2 5 8 6 3 9 graph, a spanning tree with least total weight.
• method: DFS looking for sinks (degrees with fanout 0), which are • Example: minimum-cost set of roads (edges) connecting a set of cities
then placed at the front of the growing result. (vertices).
1 list answerList; // global
2
66 Prim’s algorithm
3 void topologicalSort () { // computes answerList
4 foreach j (vertices) visited[j] = false;
5 answerList = makeEmptyList(); 1 Start with any vertex as the current tree.
6 foreach j (vertices) 2 do v − 1 times
7 if (! visited[j]) tsRecurse(j); 3 connect the current tree to the closest external vertex
8 } // topologicalSort
9
• This is a greedy algorithm: at each step, improve the solution in the
10 void tsRecurse(vertex here) { // adds to answerList way that looks best at the moment.
11 visited[here] = true; • Example: start with 5. We add: (5,6), (5,1), (1,3), (3, 4), (1, 2)
12 foreach next (successors(here)) • Implementation
13 if (! visited[next]) tsRecurse(next);
14 insertAtFront(answerList, here); • Keep a top-light heap of all external vertices based on their dis-
15 } // tsRecurse tance to the current tree (and store to which tree vertex they
connect at that distance).
• Initially, all distances are ∞ except for the neighbors of the start-
65 Spanning trees ing vertex.
• Repeatedly take the closest vertex f and add its edge to the cur-
• Class 21, 4/8/2021 rent tree.
• Weiss §9.5 • For all external neighbors b of f , perhaps f is a better way to
connect b to the tree; if so, update b’s information in the heap.
• Spanning tree: Given a connected undirected graph, a cycle-free
(Remove b and reinsert it with the better distance.)
CS315 Spring 2021 59 CS315 Spring 2021 60
• Complexity: O(v · log v + e), because for we add each vertex once, re- • Every vertex has at most one parent, initially nil.
moving it from a heap that can have v elements; we need to consider • Find the representative b’ of b by following parent links until
each edge twice (once from each end). the end.
• Find the representative c’ of c.
67 Kruskal’s algorithm • If b’ = c’, they are already in the same component. Done.
• Point either b’ to c’ or c’ to b’ by introducing a parent link be-
tween them.
1 Start with all vertices, no edges.
2 do v − 1 times • We want trees to be as shallow as possible. So record the height
3 add the lowest-cost missing edge that does not form a cycle of each tree in its root. Point the shallower one at the deeper
one.
• This is a greedy algorithm: at each step, improve the solution in the • We can compress paths while searching for the representative.
way that looks best at the moment. In this case, the height recorded in the root is just an estimate.
• We can stop when we have added v − 1 edges; all the rest will cer- • We use this data structure in Kruskal’s algorithm to avoid cycles:
tainly introduce cycles.
1 typedef struct vertex_s {
• Data representation: List of edges, sorted by weight 2 int name; // need not be int
• Complexity: assuming that keeping track of the component of each 3 struct vertex_s *representative; // NULL => me
vertex is O(log∗ v), the complexity is O(e log e + v log∗ v), because we 4 int depth; // only if I represent my group; 0 initially
must sort the edges and then add v − 1 edges. 5 } vertex_t;
• general idea
70 Euclidean algorithm: greatest common divi- • As we read the binary representation of e from left to right, we could
start with the leading 0’s without any harm.
sor (GCD)
• Example (run with the bc calculator program): 243745 mod 452. 74510 =
• Examples: gcd(12,60)=12, gcd(15,66)=3, gcd(15,67)=1. 10111010012 .
1 int gcd(a, b) { 1 a = 243
2 while (b != 0) { 2 m = 452
3 (a,b) = (b, a % b); 3 r = 1
4 } 4 r = rˆ2*a % m
5 return(a); 5 r = rˆ2 % m
6 } // gcd 6 r = rˆ2*a % m
7 r = rˆ2*a % m
a 12 60 12 8 r = rˆ2*a % m
• Example:
b 60 12 0 9 r = rˆ2 % m
10 r = rˆ2*a % m
a 15 66 15 6 3
• Example: 11 r = rˆ2 % m
b 66 15 6 3 0
12 r = rˆ2 % m
a 15 67 15 7 1 13 r = rˆ2*a % m
• Example:
b 67 15 7 1 0 14 r
costs n2 . All the additions and shifts (multiplying by powers of 1 bigInt bigMult(bigInt x, y; int n) {
10) cost just O(n), which we ignore. 2 // n-chunk multiply of x and y
• We can use the Recursion Theorem (page 17 ): cn = n + 4cn2 . 3 bigInt a, b, c, d, u, v, w;
Then a = 4, b = 2, k = 1, so a > bk , so cn = Θ(nlogb (a) ) = 4 if (n == 1) return(toBigInt(toInt(x)*toInt(y)));
Θ(nlog2 (4) ) = Θ(n2 ). 5 a = extractPart(x, 0, n/2 - 1); // high part of x
• But we can introduce u = ac, v = bd, and w = (a + b)(c + d) at a 6 b = extractPart(x, n/2, n-1); // low part of x
cost of (34)n2 . 7 c = extractPart(y, 0, n/2 - 1); // high part of y
• Now xy = u10n + (w − u − v)10n2 + v, which costs no further 8 d = extractPart(y, n/2, n-1); // low part of y
multiplications. 9 u = bigMult(a, c, n/2); // recursive
• Example 10 v = bigMult(b, d, n/2); // recursive
11 w = bigMult(bigAdd(a,b), bigAdd(c,d), n/2); // recursive
• x = 3962, y = 4481 12 return(
• a = 39, b = 62, c = 44, d = 81 13 bigAdd(
• u = ac = 1716, v = bd = 5022, w = (a + b)(c + d) = 12625 14 bigShift(u, n),
• w − u − v = 5887 15 bigAdd(
• xy = 17753722. 16 bigShift(bigSubtract(w, bigAdd(u,v)), n/2),
• In bc: 17 v
1 x = 3962
18 ) // add
2 y = 4481
19 ) // add
3 a = 39
20 );
4 b = 62
21 }
5 c = 44
6 d = 81
7 u = a*c
73 Strings and pattern matching — Text search
8 v = b*d
problem
9 w = (a+b)*(c+d)
10 x * y
• Class 24, 4/20/2021
11 u*10ˆ4 + (w-u-v)*10ˆ2 + v
• The problem: Find a match for pattern p within a text t, where p =
• We can apply this construction recursively. cn = n + 3cn2 . We m and t = n.
can again apply the Recursion Theorem (page 17 ): a = 3, b = 2,
• Application: t is a long string of bytes (a “message”), and p is a short
k = 1, so a > bk , so cn = Θ(nlogb a ) = Θ(nlog2 3 ) ≈ Θ(n1585 ).
string of bytes (a “word”).
• For small n, this improvement is small. But for n = 100, we
reduce the cost from 10, 000 to about 1480. Running bc -l: • We will look at several algorithms; there are others.
• Non-classical version: approximate match, regular expressions, more • We will start with fingerprinting, a weak version of the final method,
complicated patterns. just looking at parity, and assuming the strings are composed of 0
and 1 characters.
• The parity of a string of 0 and 1 characters is 0 if the number of 1
74 Text search — brute force algorithm characters is even; otherwise the parity is 1.
∑
• Return the smallest index j such that t[j j + m − 1] = p, or −1 if • Formula: parity = j p[j] (mod 2).
there is no match. • We can compute the parities of windows of m(= 6) bits in t. For ex-
1 int bruteSearch(char *t, char *p) { ample,
2 // returns index in t where p is found, or -1 j 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
3 const int n = strlen(t); t 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 0 1 1
4 const int m = strlen(p); tParity 1 1 0 1 0 1 0 1 0 1 0 0 1 1
5 int tIndex = 0; • Say that p = 010111, which has pParity = 0. We only need to
6 p[m] = 0xFF; // impossible character; pseudo-data consider matches starting at positions 2, 4, 6, 8, 10, and 11.
7 while (tIndex+m <= n) { // there is still room to find p
• We have saved half the work.
8 int pIndex = 0;
9 while (t[tIndex+pIndex] == p[pIndex]) // enlarge match • We can calculate tParity quickly as we move p by looking at only
10 pIndex += 1; 2, not p, characters of t:
11 if (pIndex == m) return(tIndex); // hit pseudo-data ∑
• Initially, tParity0 = 0≤j<m t[j] (mod 2).
12 tIndex += 1;
13 } // there is still room to find p • Then, tParityj+1 = tParityj + t[j] + t[j + m] (mod 2)
14 return(-1); // failure
15 } // bruteSearch
1 bit computeParity(bit *string, int length) { • The full algorithm extends fingerprinting.
2 bit answer = 0; • Instead of reducing the work to 1/2 or 1/128, we want to reduce
3 for (int index = 0; index < length; index += 1) { it to 1q for some large q.
4 answer += string[index];
• Use this hash function for m bytes t[j] t[j + m − 1]:
5 } ∑
0≤i<m 2
m−1−i
t[j+i] (mod q). Experience suggests that q should
6 return (answer & 01);
be a prime > m.
7 } // computeParity
8
• We can still update tParity quickly as we move p by looking
9 int fingerprintSearch(bit *t, bit *p) { at only 2, not p, characters of t:
10 const int n = strlen(t); tParityj+1 = (t[j + m] + 2(tParityj − 2m−1 t[j])) (mod q).
11 const int m = strlen(p); • We can use shifting to compute tParity without multiplica-
12 const int pParity = computeParity(p, m); tion: tParityj+1 = (t[j+m]+(tParityj −(t[j] << (m−1)) << 1)
13 int tParity = computeParity(t, m); // initial substring (mod q). We still need to compute mod q, however.
14 int tIndex = 0;
15 while (tIndex+m <= n) { // there is still room to find p • Class 25, 4/22/2021
16 if (tParity == pParity) { // parity check ok • Monte-Carlo substring search
17 int pIndex = 0;
18 while (t[tIndex+pIndex] == p[pIndex]) { // enlarge match • Choose q, a prime q close to but not exceeding mn2 . For in-
19 pIndex += 1; stance, if m = 10 and n = 1000, choose a prime q near 107 , such
20 if (pIndex >= m) return(tIndex); as 9,999,991.
21 } // enlarge match • The probability 1q that we will make a mistake is very low, so
22 } // parity check ok just omit the inner loop. We will sometimes have a false posi-
23 tParity = (tParity + t[tIndex] + t[tIndex+m]) & 01; tive, with probability, it turns out, less than 253n.
24 tIndex += 1; • I don’t think we save enough computation to warrant using
25 } // there is still room to find p Monte Carlo search. If false positives are very rare, it doesn’t
26 return(-1); // failure hurt to employ even a very expensive algorithm to remove them.
27 } // fingerprintSearch Checking anyway is called the “Las-Vegas version”.
• Instead of bits, we can deal with character arrays. • The idea is good, but in practice Rabin-Karp takes about 7n work.
• We generalize parity to the exclusive OR of characters, which
are just 8-bit quantities. 76 Text search — Knuth–Morris–Pratt
• The C operator for exclusive OR is ˆ.
• The update rule for tParity is • Donald Knuth, James Morris, Vaughan Pratt, 1970-1977.
tParity = tParity ˆ t[tIndex] ˆ t[tIndex+m]; • Consider t = Tweedledee and Tweedledum, p = Tweedledum.
• After running the inner loop of brute-force search to the u in p, we
• We now have reduced the work to 1/128 (for 7-bit ASCII), not have learned much about t, enough to realize that none of the letters
1/2, for the random case, because only that small fraction of up to that point in t (except the first) are T. So the next place to start
starting positions are worth pursuing. a match in t is not position 1, but position 8.
CS315 Spring 2021 69 CS315 Spring 2021 70
• Charge 1 for each replacement (R), deletion (D), insertion (I). • Greedy
D I
• Example: ghost → host → houst → house
R
• Dynamic programming
• The edit distance (s,d) is the smallest number of operations to trans- • Search
form s to d.
• We can build an edit-distance table d by this rule:
di,j = min(di−1,j + 1, di,j−1 + 1, di−1,j−1 + if s[i] = d[j] then 0 else 1).
81 Divide and conquer algorithms
• Example: peseta → presto (should get distance 3). • steps
-1 0 1 2 3 4 5
p e s e t a • if the problem size n is trivial, do it.
-1 0 1 2 3 4 5 6 • divide the problem into a easier problems of size nb.
0 p 1 0 1 2 3 4 5 • do the a easier problems
1 r 2 1 1 2 3 4 5 • combine the answers.
2 e 3 2 1 2 2 3 4
3 s 4 3 2 1 2 3 4 • We can usually compute the complexity by the Recursion Theorem
4 t 5 4 3 2 2 2 3 (page 17).
5 o 6 5 4 3 3 3 3 • cost: nk for splitting, recomputing, so Cn = nk + aCnb .
• We can trace back from the last cell to see exactly how to navigate to • Select jth smallest element of an array. a = 1, b = 2, k = 1 ⇒ O(n).
the start cell: pick any smallest neighbor to left/above.
• Quicksort. a = 2, b = 2, k = 1 ⇒ O(n log n).
• ↓: delete a character from source (left string) • Binary search, search in a binary tree. a = 1, b = 2, k = 0 ⇒ O(log n).
• →: insert a character from destination (top string)
• Multiplication (Karatsuba) a = 3, b = 2, k = 1 ⇒ O(nlog2 3 ).
• ↘: keep the same character (if number the same) or replace a
character in the source (left string) with one from the destination • Tile an n × n board that is missing a single cell by a trimino: a = 4,
(top string). b = 4, k = 0 ⇒ O(n).