Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
26 views

data-structures-and-algorithms-class-notes-pdf

Uploaded by

facapa9164
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

data-structures-and-algorithms-class-notes-pdf

Uploaded by

facapa9164
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

CS315 Class Notes

Raphael Finkel
May 4, 2021

1 Intro
Class 1, 1/26/2021

• Handout 1 — My names
• TA:
• Plagiarism — read aloud
• Assignments on web. Use C, C++, or Java.
• E-mail list:
• accounts in MultiLab
• text — we will skip around

2 Basic building blocks: Linked lists (Chapter 3)


and trees (Chapter 4)
Linked lists and trees are examples of data structures:

• way to represent information


• so it can be manipulated
• packaged with routines that do the manipulations

Leads to an Abstract Data Type (ADT): has an API (specification) and


hides its internals.

1
CS315 Spring 2021 2

3 Tools
Use
Specification

Implementation

4 Singly-linked list
• used as a part of several ADTs.
• Can be considered an ADT itself.
• Collection of nodes, each with optional arbitrary data and a pointer
to the next element on the list.

handle

a c x f
data pointer null pointer

operation cost
create empty list O(1)
insert new node at front of list O(1)
delete first node, returning data O(1)
count length O(n)
search by data O(n)
sort O(n log n) to O(n2 )
CS315 Spring 2021 3
CS315 Spring 2021 4

5 Sample code (in C)


1 #define NULL 0
2 #include <stdlib.h>
3

4 typedef struct node_s {


5 int data;
6 struct node_s *next;
7 } node;
8

9 node *makeNode(int data, node* next) {


10 node *answer = (node *) malloc(sizeof(node));
11 answer->data = data;
12 answer->next = next;
13 return(answer);
14 } // makeNode
15

16 node *insertAtFront(node* handle, int data) {


17 node *answer = makeNode(data, handle->next);
18 handle->next = answer;
19 return(answer);
20 } // insertAtFront
21

22 node *searchDataIterative(node *handle, int data) {


23 // iterative method
24 node *current = handle->next;
25 while (current != NULL) {
26 if (current->data == data) break;
27 current = current->next;
28 }
29 return current;
30 } // searchDataIterative
31

32 node *searchDataRecursive(node *handle, int data) {


33 // recursive method
34 node *current = handle->next;
35 if (current == NULL) return NULL;
36 else if (current->data == data) return current;
37 else return searchDataRecursive(current, data);
38 } // searchDataRecursive
CS315 Spring 2021 5

6 Improving the efficiency of some operations


• To make count() fast: maintain the count in a separate variable. If we
need the count more often than we insert and delete, it is worthwhile.
• To make insert at rear fast: maintain two handles, one to the front,
the other to the rear of the list.
• Combine these new items in a header node:
1 typedef struct {

2 node *front;
3 node *rear;
4 int count;
5 } nodeHeader;

• Class 2, 1/28/2021
• To make search faster: remove the special case that we reach the end
of the list by placing a pseudo-data node at the end. Keep track of
the pseudo-data node in the header.
1 typedef struct {
2 node *front;
3 node *rear;
4 node *pseudo;
5 int count;
6 } nodeHeader;
7

8 node *searchDataIterative(nodeHeader *header, int data) {


9 // iterative method
10 header->pseudo->data = data;
11 node *current = header->front;
12 while (current->data != data) {
13 current = current->next;
14 }
15 return (current == header->pseudo ? NULL : current);
16 } // searchDataIterative

• Exercise: If we want both pseudo-data and a rear pointer, how does


an empty list look?
• Exercise: If we want pseudo-data, how does searchDataRecursive()
change?
CS315 Spring 2021 6

• Exercise: Is it easy to add a new node after a given node?


• Exercise: Is it easy to add a new node before a given node?

7 Aside: Unix pipes


• Unix programs automatically have three “files” open: standard in-
put, which is by default the keyboard, standard output, which is by
default the screen, and standard error, which is by default the screen.
• In C and C++, they are defined in stdio.h by the names stdin,
stdout, and stderr.
• The command interpreter (in Unix, it’s called the “shell”) lets you
invoke programs redirecting any or all of these three. For instance,
ls | wc redirects stdout of the ls program to stdin of the wc
program.
• If you run your trains program without redirection, you can type
in arbitrary numbers.
• If you run randGen.pl without redirection, it generates an unbounded
list of pseudo-random numbers to stdout.
• If you run randGen.pl | trains, the list of numbers from randGen.pl
is redirected as input to trains.

8 Stacks, queues, dequeues: built out of either


linked lists or arrays
• We’ll see each of these.

9 Stack of integer
• Abstract definition: either empty or the result of pushing an integer
onto the stack.
• operations

• stack makeEmptyStack()
• boolean isEmptyStack(stack S)
• int popStack(stack *S) // modifies S
CS315 Spring 2021 7

• void pushStack(stack *S, int I) // modifies S

10 Implementation 1 of Stack: Linked list


• makeEmptyStack implemented by makeEmptyList()
• isEmptyStack implemented by isEmptyList()
• pushStack inserts at the front of the list
• popStack deletes from the front of the list

11 Implementation 2 of Stack: Array


• Class 3, 2/2/2021
CS315 Spring 2021 8

• Warning: it’s easy to make off-by-one errors.


1 #define MAXSTACKSIZE 10
2 #include <stdlib.h>
3

4 typedef struct {
5 int contents[MAXSTACKSIZE];
6 int count; // index of first free space in contents
7 } stack;
8

9 stack *makeEmptyStack() {
10 stack *answer = (stack *) malloc(sizeof(stack));
11 answer->count = 0;
12 return answer;
13 } // makeEmptyStack
14

15 void pushOntoStack(stack *theStack, int data) {


16 if (theStack->count == MAXSTACKSIZE) {
17 (void) error("stack overflow");
18 } else {
19 theStack->contents[theStack->count] = data;
20 theStack->count += 1;
21 }
22 } // pushOntoStack
23

24 int popFromStack(stack *theStack) {


25 if (theStack->count == 0) {
26 return error("stack underflow");
27 } else {
28 theStack->count -= 1;
29 return theStack->contents[theStack->count];
30 }
31 } // popFromStack

• The array implementation limits the size. Does the linked-list imple-
mentation also limit the size?
• The array implementation needs one cell per (potential) element, and
one for the count. How much space does the linked-list implemen-
tation need?
• We can position two opposite-sense stacks in one array so long as
CS315 Spring 2021 9

their combined size never exceeds MAXSTACKSIZE.

12 Queue of integer
• Abstract definition: either empty or the result of inserting an integer
at the rear of a queue or deleting an integer from the front of a queue.
• operations

• queue makeEmptyQueue()
• boolean isEmptyQueue(queue Q)
• void insertInQueue(queue Q, int I) // modifies Q
• int deleteFromQueue(queue Q) // modifies Q

13 Implementation 1 of Queue: Linked list


We use a header to represent the front and the rear, and we put a dummy
node at the front to make the code work equally well for an empty queue.
front

rear

header

dummy
CS315 Spring 2021 10

1 #include <stdlib.h>
2

3 typedef struct node_s {


4 int data;
5 struct node_s *next;
6 } node;
7

8 typedef struct {
9 node *front;
10 node *rear;
11 } queue;
12

13 queue *makeEmptyQueue() {
14 queue *answer = (queue *) malloc(sizeof(queue));
15 answer->front = answer->rear = makeNode(0, NULL);
16 return answer;
17 } // makeEmptyQueue
18

19 bool isEmptyQueue(queue *theQueue) {


20 return (theQueue->front == theQueue->rear);
21 } // isEmptyQueue
22

23 void insertInQueue(queue *theQueue, int data) {


24 node *newNode = makeNode(data, NULL);
25 theQueue->rear->next = newNode;
26 theQueue->rear = newNode;
27 } // insertInQueue
28

29 int deleteFromQueue(queue *theQueue) {


30 if (isEmptyQueue(theQueue)) return error("queue underflow");
31 node *oldNode = theQueue->front->next;
32 theQueue->front->next = oldNode->next;
33 if (theQueue->front->next == NULL) {
34 theQueue->rear = theQueue->front;
35 }
36 return oldNode->data;
37 } // deleteFromQueue
CS315 Spring 2021 11

14 Implementation 2 of Queue: Array


Warning: it’s easy to make off-by-one errors.

0 front rear MAX

0 rear front MAX

0 rear front MAX

full

front
0 rear MAX

empty
CS315 Spring 2021 12

1 #define MAXQUEUESIZE 30
2

3 typedef struct {
4 int contents[MAXQUEUESIZE];
5 int front; // index of element at the front
6 int rear; // index of first free space after the queue
7 } queue;
8

9 bool isEmptyQueue(queue *theQueue) {


10 return (theQueue->front == theQueue->rear);
11 } // isEmptyQueue
12

13 int nextSlot(int index) { // circular


14 return (index + 1) % MAXQUEUESIZE;
15 } // nextSlot
16

17 void insertInQueue(queue *theQueue, int data) {


18 if (nextSlot(theQueue->rear) == theQueue->front)
19 error("queue overflow");
20 else {
21 theQueue->contents[theQueue->rear] = data;
22 theQueue->rear = nextSlot(theQueue->rear);
23 }
24 } // insertInQueue
25

26 int deleteFromQueue(queue *theQueue) {


27 if (isEmptyQueue(theQueue)) {
28 return error("queue underflow");
29 } else {
30 int answer = theQueue->contents[theQueue->front];
31 theQueue->front = nextSlot(theQueue->front);
32 return answer;
33 }
34 } // deleteFromQueue
CS315 Spring 2021 13

15 Dequeue of integer
• Abstract definition: either empty or the result of inserting an integer
at the front or rear of a dequeue or deleting an integer from the front
or rear of a queue.
• operations
• dequeue makeEmptyDequeue()
• boolean isEmptyDequeue(dequeue D)
• void insertFrontDequeue(dequeue D, int I) // modifies D
• void insertRearDequeue(dequeue D, int I) // modifies D
• int deleteFrontDequeue(dequeue D) // modifies D
• int deleteRearDequeue(dequeue D) // modifies D
• Exercise: code the insertFrontDequeue() and deleteRearDequeue()
routines using an array.
• All operations for a singly-linked list implementation are O(1) except
for deleteRearDequeue(), which is O(n).
• The best list structure is a doubly-linked list with a single dummy
node.
prev data next

dum

• Exercise: Code all the routines.


• Exercise: Is it easy to add a new node after a given node?
• Exercise: Is it easy to add a new node before a given node?
CS315 Spring 2021 14

16 Searching
• Class 4, 2/4/2021
• Given n data elements (we will use integer data), arrange them in a
data structure D so that these operations are fast:

• void insert(int data, *D)


• boolean search(int data, D) (can also return entire data record)

• We don’t care about the speed of deletion (for now).


• Much of this material is in Chapter 4 of the book (trees)
• Representation 1: Linked list

• insert(i) is O(1): Place new element at the front.


• search(i) is O(n): We may need to look at whole list; we use
pseudo-data i to make search as fast as possible

• Representation 2: Sorted linked list

• insert(i) is O(n): On average, n/2 steps. Use pseudo-data (value


∞) at end to make insertion as fast as possible.
• search(i) is O(n): We may need to look at whole list; on average,
we look at n/2 elements if the search succeeds; all n elements if
it fails. Use pseudo-data (value ∞) to make search as fast as
possible.

• Representation 3: Array

• insert(i) is O(1): We place new element at the rear.


• search(i) is O(n): We may need to look at whole list; use pseudo-
data i at rear.

• Representation 4: Sorted array

• insert(i) is O(n): We need to search and then shove cells over.


• search(i) is O(log n): We use binary search.
CS315 Spring 2021 15

1 // warning: it’s easy to make off-by-one errors.


2 bool binarySearch(int target, int *array,
3 int lowIndex, int highIndex) {
4 // look for target in array[lowIndex..highIndex]
5 while (lowIndex < highIndex) { // at least 2 elements
6 int mid = (lowIndex + highIndex) / 2; // round down
7 if (array[mid] < target) lowIndex = mid + 1;
8 else highIndex = mid;
9 } // while at least 2 elements
10 return (array[lowIndex] == target);
11 } // binarySearch
CS315 Spring 2021 16

17 Quadratic search: set mid based on discrep-


ancy
Also called interpolation search, extrapolation search, dictionary search.
1 bool quadraticSearch(int target, int *array,
2 int lowIndex, int highIndex) {
3 // look for target in array[lowIndex..highIndex]
4 while (lowIndex < highIndex) { // at least 2 elements
5 if (array[highIndex] == array[lowIndex]) {
6 highIndex = lowIndex;
7 break;
8 }
9 float percent = (0.0 + target - array[lowIndex])
10 / (array[highIndex] - array[lowIndex]);
11 int mid = int(percent * (highIndex-lowIndex)) + lowIndex;
12 if (mid == highIndex) {
13 mid -= 1;
14 }
15 if (array[mid] < target) {
16 lowIndex = mid + 1;
17 } else {
18 highIndex = mid;
19 }
20 } // while at least 2 elements
21 return(array[lowIndex] == target);
22 } // quadraticSearch
Experimental results

• It is hard to program correctly.


• For 106 ≈ 220 elements, binary search always makes 20 probes.
• This result is consistent with O(log n).
• Quadratic search: 20 tests with uniform data. The range of probes
was 3 – 17; the average about 9 probes.
• Analysis shows that if the data are uniformly distributed, quadratic
search should be O(log log n).
CS315 Spring 2021 17

18 Analyzing binary search


• Binary search: cn = 1 + cn/2 where cn is the number of steps to search
for an element in an array of length n.
• We will use the Recursion Theorem: if cn = f (n) + acn/b , where
f (n) = Θ(nk ), then

when cn
a < bk Θ(nk )
a = bk Θ(nk log n)
a > bk Θ(nlogb a )

• In our case, a = 1, b = 2, k = 0, so bk = 1, so a = bk , so cn =
Θ(nk log n) = Θ(log n).
• Bad news: any comparison-based searching algorithm is Ω(log n),
that is, needs at least on the order of log n steps.
• Notation, slightly more formally defined. All these ignore multi-
plicative constants.

• O(f (n)): no worse than f (n); at most f (n).


• Ω(f (n)): no better than f (n); at least f (n).
• Θ(f (n)): no better or worse than f (n); exactly f (n).

19 Representation 5: Binary tree


• Class 5, 2/9/2021
• Example with elicited values
• Pseudo-data: in the universal “null” node.
• insert(i) and search(i) are both O(log n) if we are lucky or data are
random.
CS315 Spring 2021 18

1 #define NULL 0
2 #include <stdlib.h>
3

4 typedef struct treeNode_s {


5 int data;
6 treeNode_s *left, *right;
7 } treeNode;
8

9 treeNode *makeNode(int data) {


10 treeNode *answer = (treeNode *) malloc(sizeof(treeNode));
11 answer->data = data;
12 answer->left = answer->right = NULL;
13 return answer;
14 } // makeNode
15

16 treeNode *searchTree(treeNode *tree, int key) {


17 if (tree == NULL) return(NULL);
18 else if (tree->data == key) return(tree);
19 else if (key <= tree->data)
20 return(searchTree(tree->left, key));
21 else
22 return(searchTree(tree->right, key));
23 } // searchTree
24

25 void insertTree(treeNode *tree, int key) {


26 // assumes empty tree is a pseudo-node with infinite data
27 treeNode *parent = NULL;
28 treeNode *newNode = makeNode(key);
29 while (tree != NULL) { // dive down tree
30 parent = tree;
31 tree = (key <= tree->data) ? tree->left : tree->right;
32 } // dive down tree
33 if (key <= parent->data)
34 parent->left = newNode;
35 else
36 parent->right = newNode;
37 } // insertTree

• We will deal with balancing trees later.


CS315 Spring 2021 19

20 Traversals
• A traversal walks through the tree, visiting every node.
• Symmetric traversal (also called inorder)
1 void symmetric(treeNode *tree) {
2 if (tree == NULL) { // do nothing
3 } else {
4 symmetric(tree->left);
5 visit(tree);
6 symmetric(tree->right);
7 }
8 } // symmetric()

• Pre-order traversal
1 void preorder(treeNode *tree) {

2 if (tree == NULL) { // do nothing


3 } else {
4 visit(tree);
5 preorder(tree->left);
6 preorder(tree->right);
7 }
8 } // preorder()

• Post-order traversal
1 void postorder(treeNode *tree) {

2 if (tree == NULL) { // do nothing


3 } else {
4 postorder(tree->left);
5 postorder(tree->right);
6 visit(tree);
7 }
8 } // postorder()

• Does pseudo-data make sense?

21 Representation 6: Hashing (scatter storage)


• Hashing is often the best method for searching (but not for sorting).
CS315 Spring 2021 20

• insert(data) and search(data) are O(log n), but we can generally treat
them as O(1).
• We will discuss hashing later.

22 Finding the jth largest element in a set


• If j = 1, a single pass works in O(n) time:
1 largest = -∞; // priming
2 foreach (value in set) {
3 if (value > largest) largest = value;
4 }
5 return(largest);

• If j = 2, a single pass still works in O(n) time, but it is about twice as


costly:
1 largest = nextLargest = -∞; // priming
2 foreach (value in set) {
3 if (value > largest) {
4 nextLargest = largest;
5 largest = value;
6 } else if (value > nextLargest) {
7 nextLargest = value;
8 }
9 } // foreach value
10 return(nextLargest);

• It appears that for arbitrary j we need O(jn) time, because each iter-
ation needs t tests, where 1 ≤ t ≤ j, followed by modifying j + 1 − t
values, for a total cost of j + 1.
• Class 6, 2/11/2021
• Clever algorithm using an array: QuickSelect (Tony Hoare)

• Partition the array into “small” and “large” elements with a


pivot between them (details soon).
• Recurse in either the small or large subarray, depending where
the jth element falls. Stop if the jth element is the pivot.

• Cost: n + n/2 + n/4 + . . . = 2n = O(n)


CS315 Spring 2021 21

• We can also compute the cost using the recursion theorem (page 17):

• cn = n + cn/2 (if we are lucky)


• cn = n + c2n/3 (fairly average case)
• f (n) = n = O(n1 )
• k = 1, a = 1, b = 2 (or b = 3/2)
• a < bk
• so cn = Θ(nk ) = Θ(n)

23 Partitioning an array
• Nico Lomuto’s method
• Online demonstration.
• The method partitions array[lowIndex .. highIndex] into
three pieces:

• array[lowIndex .. divideIndex -1]


• array[divideIndex]
• array[divideIndex + 1 .. highIndex]

The elements of each piece are in order with respect to adjacent pieces.
CS315 Spring 2021 22

1 int partition(int array[], int lowIndex, int highIndex) {


2 // modifies array, returns pivot index.
3 int pivotValue = array[lowIndex];
4 int divideIndex = lowIndex;
5 for (int combIndex = lowIndex+1; combIndex <= highIndex;
6 combIndex += 1) {
7 // array[lowIndex] is the partitioning (pivot) value.
8 // array[lowIndex+1 .. divideIndex] are < pivot
9 // array[divideIndex+1 .. combIndex-1] are ≥ pivot
10 // array[combIndex .. highIndex] are unseen
11 if (array[combIndex] < pivotValue) { // see a small value
12 divideIndex += 1;
13 swap(array, divideIndex, combIndex);
14 }
15 } // each combIndex
16 // swap pivotValue into its place
17 swap(array, divideIndex, lowIndex);
18 return(divideIndex);
19 } // partition

• Example
CS315 Spring 2021 23

5 2 1 7 9 0 3 6 4 8
d c
d,c
5 2 1 7 9 0 3 6 4 8
d c
d,c
5 2 1 7 9 0 3 6 4 8
d c
d c
d c
d c
d c
d c
5 2 1 0 9 7 3 6 4 8
d c
d c
5 2 1 0 3 7 9 6 4 8
d c
d c
d c
d c
5 2 1 0 3 4 9 6 7 8
d c
d c
d c
4 2 1 0 3 5 9 6 7 8
CS315 Spring 2021 24

24 Using partitioning to select jth smallest

1 int selectJthSmallest (int array[], int size, int targetIndex) {


2 // rearrange the values in array[0..size-1] so that
3 // array[targetIndex] has the value it would have if the array
4 // were sorted.
5 int lowIndex = 0;
6 int highIndex = size-1;
7 while (lowIndex < highIndex) {
8 int midIndex = partition(array, lowIndex, highIndex);
9 if (midIndex == targetIndex) {
10 return array[targetIndex];
11 } else if (midIndex < targetIndex) { // look to right
12 lowIndex = midIndex + 1;
13 } else { // look to left
14 highIndex = midIndex - 1;
15 }
16 } // while lowIndex < highIndex
17 return array[targetIndex];
18 } // selectJthSmallest

25 Sorting
• Class 7, 2/16/2021
• We usually are interested in sorting an array in place.
• Sorting is Ω(n log n).
• Good methods are O(n log n).
• Bad methods are O(n2 ).

26 Sorting out sorting


• https://www.youtube.com/watch?v=HnQMDkUFzh4 Original film.
• https://www.youtube.com/watch?v=kPRA0W1kECg 15 meth-
ods in 6 minutes.
CS315 Spring 2021 25

27 Insertion sort
• Comb method:
sorted unsorted

probe

• n iterations.
• Iteration i may need to shift the probe value i places.
• ⇒ O(n2 ).
• Experimental results for Insertion Sort:
compares + moves ≈ n.
n compares moves n2 /2
100 2644 2545 5000
200 9733 9534 20000
400 41157 40758 80000
CS315 Spring 2021 26

1 void insertionSort(int array[], int length) {


2 // array goes from 1..length.
3 // location 0 is available for pseudo-data.
4 int combIndex, combValue, sortedIndex;
5 for (combIndex = 2; combIndex <= length; combIndex += 1) {
6 // array[1 .. combIndex-1] is sorted.
7 // Place array[combIndex] in order.
8 combValue = array[combIndex];
9 sortedIndex = combIndex - 1;
10 array[0] = combValue - 1; // pseudo-data
11 while (combValue < array[sortedIndex]) {
12 array[sortedIndex+1] = array[sortedIndex];
13 sortedIndex -= 1;
14 }
15 array[sortedIndex+1] = combValue;
16 } // for combIndex
17 } // insertionSort

• Stable: multiple copies of the same key stay in order.

28 Selection sort
• Comb method:
sorted, small unsorted, large

smallest

sorted, small unsorted, large

• n iterations.
• Iteration i may need to search through n − i places.
• ⇒ O(n2 ).
• Experimental results for Selection Sort:
compares + moves ≈ n.
CS315 Spring 2021 27

n compares moves n2 /2
100 4950 198 5000
200 19900 398 20000
400 79800 798 80000
1 void selectionSort(int array[], int length) {
2 // array goes from 0..length-1
3 int combIndex, smallestValue, bestIndex, probeIndex;
4 for (combIndex = 0; combIndex < length; combIndex += 1) {
5 // array[0 .. combIndex-1] has lowest elements, sorted.
6 // Find smallest other element to place at combIndex.
7 smallestValue = array[combIndex];
8 bestIndex = combIndex;
9 for (probeIndex = combIndex+1; probeIndex < length;
10 probeIndex += 1) {
11 if (array[probeIndex] < smallestValue) {
12 smallestValue = array[probeIndex];
13 bestIndex = probeIndex;
14 }
15 }
16 swap(array, combIndex, bestIndex);
17 } // for combIndex
18 } // selectionSort

• Not stable, because the swap moves an arbitrary value into the un-
sorted area.

29 Quicksort (C. A. R. Hoare)


• Class 8, 2/18/2021
• Recursive based on partitioning:
CS315 Spring 2021 28

random

partition

small big

sort sort

• about log n depth.


• each depth takes about O(n) work.
• ⇒ O(n log n).
• Can be unlucky: O(n2 ).
• To prevent worst-case behavior, partition based on median of 3 or 5.
• Don’t Quicksort small regions; use a final insertionSort pass instead.
Experiments how that the optimal break point depends on the im-
plementation, but somewhere between 10 and 100 is usually good.
• Experimental results for QuickSort:
compares + moves ≈ 2.4 n log n.
n compares moves n log n n2 /2
100 643 824 664 5000
200 1444 1668 1528 20000
400 3885 4228 3457 80000
800 8066 8966 7715 320000
1600 17583 18958 17030 1280000

• Analysis if lucky: Cn = n + 2Cn/2 , so k = 1, a = 2, b = 2, so a = bk , so


Cn = Θ(nk log n) = Θ(n log n).
• Analysis if unlucky: Cn = n + Cn/3 + C2n/3 < n + 2C2n/3 , so k = 1, a =
2, b = 3/2, so a > bk , so Cn < Θ(nlogb a ) = Θ(nlog3/2 2 ) ≈ Θ(n1.70951 ),
which is still better than quadratic.
CS315 Spring 2021 29

1 void quickSort(int array[], int lowIndex, int highIndex){


2 if (highIndex - lowIndex <= 0) return;
3 // could stop if <= 6 and finish by using insertion sort.
4 int midIndex = partition(array, lowIndex, highIndex);
5 quickSort(array, lowIndex, midIndex-1);
6 quickSort(array, midIndex+1, highIndex);
7 } // quickSort

30 Shell Sort (Donald Shell, 1959)


• Each pass has a span s.
1 for (int span in reverse(spanSequence)) {
2 for (int offset = 0; offset < span; offset += 1) {
3 insertionSort(a[offset], a[offset+span], ... )
4 } // each offset
5 } // each span

• The last element in spanSequence must be 1.


• Tokuda’s sequence: s0 = 1; sk = 2.25sk−1 + 1; spank = dsk e = 1, 4, 9,
20, 46, 103, 233, 525, 1182, 2660, ...
• Experimental results for Shell Sort: compares + moves ≈ 2.2n log n.
n compares moves n log n n2 /2
100 355 855 664 5000
200 932 1932 1528 20000
400 2266 4666 3457 80000
800 5216 10816 7715 320000
1600 11942 24742 17030 1280000

31 Heaps: a kind of tree


• Class 9, 2/25/2021
• Heap property: the value at a node is ≤ the value of each child (for a
top-light heap) or ≥ the value of each child (for a top-heavy heap).
• The smallest (largest) value is therefore at the root.
• All leaves are at the same level ±1.
CS315 Spring 2021 30

• To insert

• Place new value at “end” of tree.


• Let the new value sift up to its proper level.

• To delete: always delete the least (root) element

• Save value at root to return it later.


• Move the last value to the root.
• Let the new value sift down to its proper level.

• Storage

• Store the tree in an array [1 . . . ]


• leftChild[index] = 2*index
• rightChild[index] = 2*index+1
• the last occupied place in the array is at index heapSize.

• Applications

• Sorting
• Priority queue
CS315 Spring 2021 31

1 // basic algorithms (top-light heap)


2

3 void siftUp (int heap[], int subjectIndex) {


4 // the element in subjectIndex needs to be sifted up.
5 heap[0] = heap[subjectIndex]; // pseudo-data
6 while (1) { // compare with parentValue.
7 int parentIndex = subjectIndex / 2;
8 if (heap[parentIndex] <= heap[subjectIndex]) return;
9 swap(heap, subjectIndex, parentIndex);
10 subjectIndex = parentIndex;
11 }
12 } // siftUp
13

14 int betterChild (int heap[], int subjectIndex, int heapSize) {


15 int answerIndex = subjectIndex * 2; // assume better child
16 if (answerIndex+1 <= heapSize &&
17 heap[answerIndex+1] < heap[answerIndex]) {
18 answerIndex += 1;
19 }
20 return(answerIndex);
21 } // betterChild
22

23 void siftDown (int heap[], int subjectIndex, int heapSize) {


24 // the element in subjectIndex needs to be sifted down.
25 while (2*subjectIndex <= heapSize) {
26 int childIndex = betterChild(heap, subjectIndex, heapSize);
27 if (heap[childIndex] >= heap[subjectIndex]) return;
28 swap(heap, subjectIndex, childIndex);
29 subjectIndex = childIndex;
30 }
31 } // siftUp
CS315 Spring 2021 32

1 // intermediate algorithms
2

3 void insertInHeap (int heap[], int *heapSize, int value) {


4 *heapSize += 1; // should check for overflow
5 heap[*heapSize] = value;
6 siftUp(heap, *heapSize);
7 } // insertInHeap
8

9 int deleteFromHeap (int heap[], int *heapSize) {


10 int answer = heap[1];
11 heap[1] = heap[*heapSize];
12 *heapSize -= 1;
13 siftDown(heap, 1, *heapSize);
14 return(answer);
15 } // deleteFromHeap

1 // advanced algorithm
2

3 void heapSort(int array[], int arraySize){


4 // sorts array[1..arraySize] by first making it a
5 // top-heavy heap, then by successive deletion.
6 // Deleted elements go to the end.
7 int index, size;
8 array[0] = −∞; // pseudo-data
9 // The second half of array[] satisfies the heap property.
10 for (index = (arraySize+1)/2; index > 0; index -= 1) {
11 siftDown(array, index, arraySize);
12 }
13 for (index = arraySize; index > 0; index -= 1) {
14 array[index] = deleteFromHeap(array, &arraySize);
15 }
16 } // heapSort

• This method of heapifying is O(n):

• 1/2 the elements require no motion.


• 1/4 the elements may sift down 1 level.
• 1/8 the elements may sift down 2 levels.
• Total motion = (n/2) · 1≤j j/2j
P
CS315 Spring 2021 33

• That formula approaches n as j → ∞

• Total complexity is therefore O(n + n log n) = O(n log n).


• This sorting method is not stable, because sifting does not preserve
order.

Experimental results for Heap Sort: compares + moves ≈ 3.1n log n.


n compares moves n log n n2 /2
100 755 1190 664 5000
200 1799 2756 1528 20000
400 4180 6196 3457 80000
800 9621 14050 7715 320000
1600 21569 31214 17030 1280000

32 Bin sort
• Assumptions: values lie in a small range; there are no duplicates.
• Storage: build an array of bins, one for each possible value. Each is
1 bit long.
• Space: O(r), where r is the size of the range.
• Place each value to sort as a 1 in its bin. Time: O(n).
• Read off bins in order, reporting index if it is 1. Time: O(r).
• Total time: O(n + r).
• Total memory: O(r), which can be expensive.
• Can handle duplicates by storing a count in each bin, at a further
expense of memory.
• This sorting method does not work for arbitrary data having nu-
meric keys; it only sorts the keys, not the data.

33 Radix sort
• Example: use base 10, with values integers 0 – 9999, with 10 bins,
each holding a list of values, initially empty.
• Pass 1: insert each value in a bin (at rear of its list) based on the last
digit of the value.
CS315 Spring 2021 34

• Pass 2: examine values in bin order, and in list order within bins,
placing them in a new copy of bins based on the second-to-last digit.
• Pass 3, 4: similar.
• The number of digits is O(log n), so there are O(log n) passes, each
of which takes O(n) time, so the algorithm is O(n log n).
• This sorting method is stable.

34 Merge sort

1 void mergeSort(int array[], int lowIndex, int highIndex){


2 // sort array[lowIndex] .. array[highIndex]
3 if (highIndex - lowIndex < 1) return; // width 0 or 1
4 int mid = (lowIndex+highIndex)/2;
5 mergeSort(array, lowIndex, mid);
6 mergeSort(array, mid+1, highIndex);
7 merge(array, lowIndex, highIndex);
8 } // mergeSort
9

10 void merge(int array[], int lowIndex, int highIndex) {


11 int mid = (lowIndex+highIndex)/2;
12 // copy the relevant parts of array to two temporaries
13 // walk through the temporaries in tandem,
14 // placing smaller in array, ties honor left version.
15 } // merge

• cn = n + 2cn/2
• a = 2, b = 2, k = 1 ⇒ O(n log n).
• This time complexity is guaranteed.
• Space needed: 2n, because merge in place is awkward (and expen-
sive).
• The sort is also stable: it preserves the order of identical keys.
• Insertion, radix, and merge sort are stable, but not selection, Quick-
sort or Heapsort.
CS315 Spring 2021 35

Experimental results for Merge Sort: compares + moves ≈ 2.9n log n.


n compares moves n log n n2 /2
100 546 1344 664 5000
200 1286 3088 1528 20000
400 2959 6976 3457 80000
800 6741 15552 7715 320000
1600 15017 34304 17030 1280000

35 Red-black trees (Guibas and Sedgewick 1978)


• Class 10, 3/2/2021
• Red-black trees balance themselves during online insertion.
• Their representation requires pointers both to children and to the
parent.
• Each node is red or black.
• The pseudo-nodes (or null nodes) at bottom are black.
• The root node is black.
• Red nodes have only black children. So no path has two red nodes
in a row.
• All paths from the root to a leaf have the same number of black
nodes.
• For a node x, define black-height(x) = number of black nodes on a
path down from x, not counting x.
• The algorithm manages to keep height of the tree ≤ 2 log(n + 1).
• To keep the tree acceptable, we sometimes rotate, which reorganizes
the tree locally without changing the symmetric traversal.
y x
right
x c a y
left
a b b c
• To insert

• place new node n in the tree and color it red. O(log n).
CS315 Spring 2021 36

• walk up the tree from n , rotating as needed to restore color


rules. O(log n).
• color the root black.
case 1: parent and uncle red
Circled: black; otherwise: red
g g*
Star: continue up the tree here
recolor
p u p u

c* c

case 2: parent red, uncle black, c inside

g g
rotate c up
u p p down u c
continue to case 3
c* c3 c1 p*

c1 c2 c2 c3

case 3: parent red, uncle black, c outside

g g p
recolor rotate p up

u p u p g down g n

c1 c* c1 c u c1

• try with values 1..6:


CS315 Spring 2021 37

final
1 color 1 case 3 1 rotate 2 case 1
2 color 2 1 3 color
3 3 4

2 2 2
case 3 rotate
1 3 1 3 1 4 case 1
color
4 4 5 color
3
5 5 6

2
1 4
3 5
6

• try with these values: 5, 2, 7, 4 (case 1), 3 (case 2), 1 (case 1)

36 Review of binary trees


• Binary trees have expected O(log n) depth, but they can have O(n)
depth.
• insertion
• traversal: preorder, postorder, inorder=symmetric order.
• deletion of node D

• If D is a leaf, remove it.


• If D has one child C, move C in place of D.
• If D has two children, find its successor: S = RL∗ . Move S in
place of D. S has no left child, but if it has a right child C, move
C in place of S.

37 Ternary trees
• Class 11, 3/4/2021
CS315 Spring 2021 38

• By example.
• The depth of a balanced ternary tree is log3 n, which is only 63% the
depth of a balanced binary tree.
• The number of comparisons needed to traverse an internal node dur-
ing a search is either 1 or 2; average 5/3.
• So the number of comparisons to reach a leaf is 53 log3 n instead of
(for a binary tree) log2 n, a ratio of 1.05, indicating a 5% degradation.
• The situation gets only worse for larger arity. For quaternary trees,
the degradation (in comparison to binary trees) is about 12.5%.
• And, of course, an online construction is not balanced.
• Moral: binary is best; higher arity is not helpful.

38 Quad trees (Finkel 1973)


• Extension of sorted binary trees to two dimensions.
• Internal nodes contain a discriminant, which is a two-dimensional
(x,y) value.
• Internal nodes have four children, corresponding to the four quad-
rants from the discriminant.
• Leaf nodes contain a bucket of b values.
• Insertion

• Dive down the tree, put new value in its bucket.


• If the bucket overflows, pick a good discriminant and subdi-
vide.
• Good discriminant: one that separates the values as evenly as
possible. Suggestion: median (x, y) values.

• Offline algorithm to build a balanced tree

• Put all elements in a single bucket, then recursively subdivide


as above.

• Generalization: for d-dimensional data, let each discriminant have d


values. A node can have up to 2d children. This number becomes
cumbersome when d grows above about 3.
CS315 Spring 2021 39

• Heavily used in 3-d modeling for graphics, often with discriminant


chosen as midpoint, not median.

39 k-d trees (Bentley and Finkel 1973)


• Extension of sorted binary trees to d dimensions.
• Especially good when d is high.
• Internal nodes contain a dimension number (0 .. d − 1) and a dis-
criminant value (real).
• Internal nodes have two children, corresponding to values ≤ and >
the discriminant in the given dimension.
• Leaf nodes contain a bucket of b values.
• Offline construction and online insertion are similar to quad trees.

• To split a bucket of values, pick the dimension number with the


largest range across those values.
• Given the dimension, pick the median of the values in that di-
mension as the discriminant.
• That choice of dimension number tends to make the domain
of each bucket roughly cubical; that choice of discriminant bal-
ances the tree.

• Nearest-neighbor search: Given a d-dimensional probe value p, to


find the nearest neighbor to p that is in the tree.

• Dive into the tree until you find p’s bucket.


• Find the closest value in the bucket to p. Cost: b distance mea-
sures. Result: a ball around p.
• Walking back up to the root, starting at the bucket:
• If the domain of the other child of the node overlaps the
ball, dive into that child.
• If the ball is entirely contained within the node’s domain,
done.
• Otherwise walk one step up toward the root and continue.
• complexity: Initial dive is O(n), but the expected number of
buckets examined is O(1).
CS315 Spring 2021 40

• Used for cluster analysis, categorizing (as in optical character recog-


nition).

40 2-3 trees (John Hopcroft, 1970)


• Class 12, 3/9/2021
• By example.
• Like a ternary tree, but different rule of insertion
• Always completely balanced
• A node may hold 1, 2, or 3 (temporarily) values.
• A node may have 0 (only leaves), 2, 3, or 4 (temporarily) children.
• A node that has 3 values splits and promotes its middle value to its
parent (recursively up the tree).
• If the root splits, it promotes a new root.
• Complexity: O(n log n) for insertion and search, guaranteed.
• Deletion: unpleasant.

41 Stooge Sort
• A terrible method, but fun to analyze.
CS315 Spring 2021 41

1 #include <math.h>
2

3 void stoogeSort(int array[], int lowIndex, int highIndex){


4 // highIndex is one past the end
5 int size = highIndex - lowIndex;
6 if (size <= 1) { // nothing to do
7 } else if (size == 2) { // direct sort
8 if (array[lowIndex] > array[lowIndex+1]) {
•9 swap(array, lowIndex, lowIndex+1);
10 }
11 } else { // general case
12 float third = ((float) size) / 3.0;
13 stoogeSort(array, lowIndex, ceil(highIndex - third));
14 stoogeSort(array, floor(lowIndex + third), highIndex);
15 stoogeSort(array, lowIndex, ceil(highIndex - third));
16 }
17 } // stoogeSort

• cn = 1 + 3c2n/3
• a = 3, b = 3/2, k = 0, so bk = 1. By the recursion theorem (page 18),
since a > bk , we have complexity Θ(nlogb a ) = Θ(nlog3/2 3 ) ≈ Θ(n2.71 ), so
Stooge Sort is worse than quadratic.
• However, the recursion often encounters already-sorted sub-arrays.
If we add a check for that situation, Stooge Sort becomes roughly
quadratic.

42 B trees (Ed McCreight 1972)


• A generalization of 2-3 trees when McCreight was at Boeing, hence
the name.
• Choose a number m (the bucket size) such that m values plus m
disk indices fit in a single disk block. For instance, if a block is 4KB,
a value takes 4B, and an index takes 4B, then m = 4KB/8B = 512.
• m = 3 ⇒ 2-3 tree.
• Class 13, 3/11/2021
• Each node has 1 .. m − 1 values and 0 .. m children. (We have room
for m values; the extra can be used for pseudo-data.)
CS315 Spring 2021 42

• Shorthand: g = dm/2e (the half size)


• Internal nodes (other than the root) have g .. m children.
• Insertion

• Insert in appropriate leaf.


• If current node overflows (has m values) split it into two nodes
of g values each; hoist the middle value up one level.
• When a node splits, its parent’s pointer to it becomes two point-
ers to the new nodes.
• When a value is hoisted, iterate up the tree checking for over-
flow.

• B+ tree variant: link leaf nodes together for quicker inorder traversal.
This link also allows us to avoid splitting a leaf if its neighbor is not
at capacity.
• A densely filled tree with n keys (values), height h:
mh+1 −1
• Number of nodes a = 1 + m + m2 + · · · + mh = m−1
.
• Number of keys n = (m − 1)a = mh+1 − 1 ⇒ logm (n + 1) =
h + 1 ⇒ h is O(log n).

• A sparsely filled tree with n keys (values), height h:

• The root has two subtrees; the others have g = dm/2e subtrees,
so:
h −1)
• Number of nodes a = 1 + 2(1 + g + g 2 + · · · + g h−1 ) = 1 + 2(gg−1 .
• The root has 1 key, the others have g − 1 keys, so:
• Number of keys n = 1+2(g h −1) = 2g h −1 ⇒ h = logg (n+1)/2 =
O(log n).

43 Deletion from a B tree


• Deletion from an internal node: replace value with successor (taken
from a leaf), and then proceed to deletion from a leaf.
• Deletion from a leaf: the bad case is that it can cause underflow: the
leaf now has fewer than g keys.
• In case of underflow, borrow a value from a neighbor if possible,
adjusting the appropriate key in the parent.
CS315 Spring 2021 43

• If all neighbors (there are 1 or 2) are already minimal, grab a key from
the parent and also merge with a neighbor.
• In general, deletion is quite difficult.

44 Hashing
• Very popular data structure for searching.
• Cost of insertion and of search is O(log n), but only because n distinct
values must be log n bits long, and we need to look at the entire key.
If we consider looking at a key to be O(1), then hashing is expected
(but not guaranteed) to be O(1).
• Idea: find the value associated with key k at A[h(k)], where

• h() maps keys to integers in 0..s − 1, where s is the size of A[ ].


• h() is “fast”. (It generally needs to look at all of k, though.)

• Example

• k = student in class.
• h(k) = k’s birthday (a value from 0 .. 365).

• Difficulty: collisions
365!
• Birthday paradox: Prob(no collisions with j people) = (365−j)!365j
• This probability goes below 1/2 at j = 23.
• At j = 50, the probability is 0.029.

• Moral: One cannot in general avoid collisions. One has to deal with
them.

45 Hashing: Dealing with collisions: open ad-


dressing
• Overview

• The following methods store all items in A[ ] and use a probe


sequence. If the desired position is occupied, use some other
position to consider instead.
CS315 Spring 2021 44

• These methods suffer from clustering.


• Deletion is hard, because removing an element can damage un-
related searches. Deletion by marking is the only reasonable
approach.

• Perfect hashing: if you know all n values in advance, you can look
for a non-colliding hash function h. Finding such a function is in
general quite difficult, but compiler writers do sometimes use perfect
hashing to detect keywords in the language (like if and for).
• Linear probing. Probe p is at index h(k) + p (mod s), for p = 0, 1, . . ..

• Terrible behavior when A[ ] is almost full, because chains coa-


lesce. This problem is called “primary clustering”.

• Additional hash functions. Use a family of hash functions, h1 (), h2 (), . . ..

• insertion: key probing with different functions until an empty


slot is found.
• searching: probe with different functions until you find the key
(success) or an empty slot (failure).
• You need a family of independent hash functions.
• The method is very expensive when A[ ] is almost full.

46 Review for midterm


Class 14, 3/16/2021
Insert the following items: 31 11 4 12 51 9 2 6 52 32 into:
• binary tree. Preorder result: 31 11 1 2 2 32 51 52 9 6
• top-light heap. Breadth-order result: 11 12 2 31 32 9 4 6 52 51
• array, then heapify. Breadth-order result: 11 12 2 31 32 9 4 6 52 51
• ternary tree. Preorder result: (11 , 3) 12 (2, 32 ) (4, 51 ) 52 (6, 9)
• array, then 5 steps of selection sort. Result: 11 12 2 31 32 | 9 4 6 52 51
Note: not stable.
• array, then 5 steps of insertion sort. Result: 11 12 31 4 51 | 9 2 6 52 32
Note: stable. Can force anti-stable.
• array, then first step of Quicksort, using Lomuto’s partitioning. final
insertionSort.
CS315 Spring 2021 45

• 2-3 tree. Preorder result: 3 1 1 (2, 3) 5 (4, 5) (6, 9)


• red-black tree. Preorder result: 3b 1 1b 2b 3 5 4b 5 9b 6

47 Midterm exam
Class 15, 3/18/2021

48 Midterm exam follow-up


Class 16, 3/23/2021

49 Hashing: more open-addressing methods


• Class 17, 3/25/2021
• Quadratic probing. Probe p is at index h(k) + p2 (mod s), for p =
0, 1, . . ..

• When does this sequence hit all of A[ ]? Certainly it does if s is


prime.
• We still suffer “secondary clustering”: if two keys have the same
hash value, then the sequence of probes is the same for both.

• Add-the-hash rehash. Probe p is at index (p + 1) · h(k) (mod s).

• This method avoids clustering.


• Warning: h(k) must never be 0.

• Double hashing. Use two has functions, h1 () and h2 (). Probe p is at


index h1 (k) + p · h2 (k).

• This method avoids clustering.


• Warning: h2 (k) must never be 0.
CS315 Spring 2021 46

50 Hashing: Dealing with collisions: external chain-


ing
• Each element in A is a pointer, initially null, to a bucket, which is a
linked list of nodes that hash to that element; each node contains k
and any other associated data.
• insert: place k at the front of A[h(k)].
• search: look through the list at A[h(k)].

• optimization: When you find, promote the node to the start of


its list.

• average list length is s/n. So if we set s ∼


= n we expect about 1 ele-
ment per list, although some may be longer, some empty.
• Instead of lists, we can use something fancier (such as 2-3 trees), but
it is generally better to use a larger s.

51 Hashing: What is a good hash function?


• Want it to be

• Uniform: Equally likely to give any value in 0..s − 1.


• Fast.
• Spreading: similar inputs → dissimilar outputs, to prevent clus-
tering. (Only important for open addressing, as described be-
low.)

• Several suggestions, assuming that k is a multi-word data structure,


such as a string.

• Add (or multiply) all (or some of) the words of k, discarding
overflow, then mod by s. It helps if s = 2j , because mod is then
masking with 2j − 1.
• XOR the words of k, shifting left by 1 after each, followed by
mod s.

• Wisdom: The hash function doesn’t make much difference. It is not


necessary to look at all of k. Just make sure that h(k) is not constant
(except for testing collision resolution).
CS315 Spring 2021 47

52 Hashing: How big should the array be?


• Some open-addressing methods prefer that s = ||Array|| be prime.
• Computing h() is faster if s = 2j for some j.
• Open addressing gets very bad if s < 2n, depending on method.
Linear probing is the worst; I would make sure s ≥ 3n.
• External chaining works fine even when s ∼ = n, but it gets steadily
worse.

53 Hashing: What should we do if we discover


that s is too small?
• We can rebuild with a bigger s, rehashing every element. But that
operation causes a temporary “outage”, so it is not acceptable for
online work.
• Extendible hashing

• Start with one bucket. If it gets too full (list longer than 10, say),
split it on the last bit of h(k) into two buckets.
• Whenever a bucket based on the last j bits is too full, split it
based on bit j + 1 from the end.
• To find the bucket
• compute v = h(k).
• follow a tree that discriminates on the last bits of v. This
tree is called a trie.
• it takes at most log v steps to find the right bucket.
• Searching within the bucket now is guaranteed to take con-
stant time (ignoring the log n cost of comparing keys)

54 Hash tables (associative arrays) in scripting


languages
• Class 18, 3/30/2021
• Like an array, but the indices are strings.
CS315 Spring 2021 48

• Resizing the array is automatic, although one might specify the ex-
pected size in advance to avoid resizing during early growth.
• Perl has a built-in datatype called a hash.

1 my %foo;
2 foo{"this"} = "that".

• Python has dictionaries.

1 Foo = dict()
2 Foo[’this’] = ’that’;

• JavaScript arrays are all associative.

1 const foo = [];


2 foo[’this’] = ’that’;
3 foo.this = ’that’;

55 Cryptographic hashes: digests


• purpose: uniquely identify text of any length.
• these hashes are not used for searching.
• goals

• fast computation
• uninvertable: given h(k), it should be infeasible to compute k.
• it should be infeasible to find collisions k1 and k2 such that h(k1 ) =
h(k2 ).

• examples

• MD5: 128 bits. Practical attack in 2008.


• SHA-1: 160 bits, but (2005) one can find collisions in 269 hash
operations (brute force would use 280 )
• SHA-2: usual variant is SHA256; also SHA-512.

• uses
CS315 Spring 2021 49

• storing passwords (used as a trap-door function)


• catching plagiarism
• for authentication (h(m + s) authenticates m to someone who
shares the secret s, for example)
• tripwire: intrusion detection

56 Graphs
• Our standard graph:
1 e1 2

e2
e4 e5 3 4 e3 5

6 7
e7
e6

• Nomenclature

• vertices: V is the name of the set, v is the size of the set. In our
example, V = {1, 2, 3, 4, 5, 6, 7}.
• edges: E is the name of the set, e is the size of the set. In our
example, E = {e1, e2, e3, e4, e5, e6, e7}.
• directed graph: edges have direction (represented by arrows).
• undirected graph: edges have no direction.
• multigraph: more than one edge between two vertices. We gen-
erally do not deal with multigraphs, and the word graph gen-
erally disallows them.
• weighted graph: each edge has numeric label called its weight.

• Graphs represent situations

• streets in a city. We might be interested in computing paths.


• airline routes, where the weight is the price of a flight. We might
be interested in minimal-cost cycles.
• Hamiltonian cycle: no duplicated vertices (cities).
• Eulerian cycle: no duplicated edges (flights).
• Islands and bridges, as in the bridges of Königsburg, later called
Kaliningrad (Euler 1707-1783). This is a multigraph, not strictly
CS315 Spring 2021 50

a graph.
B

A
D

C
Can you find an Eulerian cycle?

• Family trees. These graphs are bipartite: Family nodes and per-
son nodes. We might want to find the shortest path between
two people.
• Cities and roadways, with weights indicating distance. We might
want a minimal-cost spanning tree.

57 Data structures representing a graph


• Adjacency matrix

• an array n × n of Boolean.
• A[i, j] = true ⇒ there is an edge from vertex i to vertex j.
1 2 3 4 5 6 7
1 x x
2 x x x
3 x x
4 x
5 x
6 x x
7 x x x
• The array is symmetric if the graph is undirected
• in this case, we can store only one half of it, typically in a
1-dimensional array
• A[i(i − 1)/2 + j] holds information about edge i, j.
• Instead of Boolean, we can use integer values to store edge weights.
CS315 Spring 2021 51

• Adjacency list

• an array n of singly-linked lists.


• j is in linked list A[i] if there is an edge from vertex i to vertex
j.
1 2→6
2 1→3→7
3 2→7
4 5
5 4
6 1→7
7 2→3→6

58 Computing the degree of all vertices


• Adjacency matrix: O(v 2 ).
1 foreach vertex (0 .. v-1) {
2 degree[vertex] = 0;
3 foreach neighbor in 0 .. v-1 {
4 if (A[vertex, neighbor]) degree[vertex] += 1;
5 }
6 }

• Adjacency list: O(v + e).


1 foreach vertex (0 .. v-1) {
2 degree[vertex] = 0;
3 for (neighbor = A[vertex]; neighbor != null;
4 neighbor = neighbor->next) {
5 degree[vertex] += 1;
6 }
7 }

59 Computing the connected component contain-


ing vertex i in an undirected graph
• why: to segment an image.
• Class 19, 4/1/2021
CS315 Spring 2021 52

• method: depth-first search (DFS).


1 void DFS(vertex here) {
2 // assume visited[*] == false at start
3 visited[here] = true;
4 foreach next (successors(here)) {
5 if (! visited[next]) DFS(next);
6 }
7 } // DFS

• DFS is faster with adjacency list: O(e0 + v 0 ), where e0 , v 0 only count to


the number of edges and vertices in the connected component.
• DFS is slower with adjacency matrix: O(v + v 0 ).
• For our standard graph (page 49), assuming that the adjacency lists
are all sorted by vertex number (or that we use the adjacency matrix),
starting at vertex 1, we invoke DFS on these vertices: 1, 2, 3, 7, 6.
• DFS can be coded iteratively with an explicit stack
1 void DFS(vertex start) {
2 // assume visited[*] == false at start
3 workStack = makeEmptyStack();
4 pushStack(workStack, start)
5 while (! isEmptyStack(workStack)) {
6 place = popStack(workStack);
7 if (visited[place]) continue;
8 visited[place] = true;
9 foreach neighbor (successors(place)) {
10 if (! visited[neighbor]) {
11 pushStack(workStack, neighbor);
12 // could record "place" as parent
13 } // "neighbor" is not yet visited
14 } // foreach neighbor
15 } // while workStack not empty
CS315 Spring 2021 53

60 To see if a graph is connected


• See if DFS hits every vertex.
1 bool isConnected() {
2 foreach vertex (vertices)
3 visited[vertex] = false;
4 DFS(0); // or any vertex
5 foreach vertex (vertices)
6 if (! visited[vertex]) return false;
7 return true;
8 } // isConnected

61 Breadth-first search
• applications

• find shortest path in a family tree connecting two people


• find shortest route through city streets
• find fastest itinerary by plane between two cities

• method: place unfinished vertices in a queue. These are the ones we


still need to visit, in order closest to furthest.
CS315 Spring 2021 54

1 void BFS(vertex start) {


2 // assume visited[*] == false at start
3 workQueue = makeQueue();
4 visited[start] = true;
5 insertInQueue(workQueue, start)
6 while (! emptyQueue(workQueue)) {
7 place = deleteFromQueue(workQueue); // from front
8 foreach neighbor (successors(place)) {
9 if (! visited[neighbor]) {
10 visited[neighbor] = true;
11 insertInQueue(workQueue, neighbor); // to rear
12 // or: insert (place, neighbor)
13 // to remember path to start
14 } // not visited
15 } // foreach neighbor
16 } // while queue not empty
17 } // BFS

• For our standard graph (page 49), assuming that the adjacency lists
are all sorted by vertex number (or that we use the adjacency matrix),
starting at vertex 1, BFS visits these vertices: 1, 2, 6, 3, 7.
• using adjacency lists, BFS is O(v 0 + e0 ).

62 Shortest path between vertices i and j


• Compute BFS(i), but stop when you visit j.

• Actually, you can stop when you place j in the queue.


• Construct the path by building a back chain when you insert
a vertex in the queue. That is, you insert a pair: (place,
neighbor).

• If edges are weighted:

• Use a heap (top-light) instead of a queue. That’s why heaps are


sometimes called priority queues.
• stop when you visit j, not when you place j in the queue.

• Class 20, 4/6/2021


CS315 Spring 2021 55

1 void weightedBFS(vertex start, vertex goal) {


2 // assume visited[*] == () at start
3 workHeap = makeHeap(); // top-light
4 insertInHeap(workHeap, (0, start, start));
5 // distance, vertex, from where
6 while (! emptyHeap(workHeap)) {
7 (distance, place, from) = deleteFromHeap(workHeap);
8 if (visited[place] != ()) continue; // already seen
9 visited(place) = (from, distance);
10 if (place == goal) return; // could print path
11 foreach (neighbor, weight) in (successors(place)) {
12 insertInHeap(workHeap, (distance+weight, neighbor, place));
13 } // foreach neighbor
14 } // while queue not empty
15 } // BFS

63 Dijkstra’s algorithm: Finding all shortest paths


from a given vertex in a weighted graph
The weights must be positive. Weiss §9.3.2

• Rule: among all vertices that can extend a shortest path already
found, choose the one that results in a shortest path. If there is a
tie ending at the same vertex, choose either. If there is a tie going to
different vertices, choose both.
• This is an example of a greedy algorithm: at each step, improve the
solution in the way that looks best at the moment.
• Starting position: one path, length 0, from start vertex j to j.
CS315 Spring 2021 56

1 2
80

40 100
path length
20
60 3 4 5 →0
56 → 40
60
120 120 53 120
51 60
51 → 60
5 40 6 563 100 better way to add vertex 3
564 160
512 140
563 → 100
564 160
512 140
5634 → 120 better way to add vertex 4
• Another example:
1
30 30 path length
1 →0
2 10 3 20 4 12 →3
10 14 →3
40 30 123 →4
5 Start at 1. 1 2 5 7
143 5
145 6
125 7
145 6
1235 →5

64 Topological sort
• Sample application: course prerequisites place some pairs of courses
in order, leading to a directed, acyclic graph (DAG). We want to find
a total order; there may be many acceptable answers.
• Weiss §9.2
CS315 Spring 2021 57

115

215
1 2 3

216 275

4 5 6
280 315

7 8 9 470 335 405

10 471
Possible results:
1 4 10 1 2 6 5 7 8 3 9
2 10 4 7 1 2 5 8 6 3 9

• method: DFS looking for sinks (degrees with fanout 0), which are
then placed at the front of the growing result.
1 list answerList; // global
2

3 void topologicalSort () { // computes answerList


4 foreach j (vertices) visited[j] = false;
5 answerList = makeEmptyList();
6 foreach j (vertices)
7 if (! visited[j]) tsRecurse(j);
8 } // topologicalSort
9

10 void tsRecurse(vertex here) { // adds to answerList


11 visited[here] = true;
12 foreach next (successors(here))
13 if (! visited[next]) tsRecurse(next);
14 insertAtFront(answerList, here);
15 } // tsRecurse

65 Spanning trees
• Class 21, 4/8/2021
• Weiss §9.5
• Spanning tree: Given a connected undirected graph, a cycle-free
CS315 Spring 2021 58

connected subgraph containing all the original vertices.


40 40
1 2 1 2

20 50
30 30 20

10 10
3 4 3 4
60 30
60
5 20 6 5 20 6
• Minimum-weight panning tree: Given a connected undirected weighted
graph, a spanning tree with least total weight.
• Example: minimum-cost set of roads (edges) connecting a set of cities
(vertices).

66 Prim’s algorithm

1 Start with any vertex as the current tree.


2 do v − 1 times
3 connect the current tree to the closest external vertex

• This is a greedy algorithm: at each step, improve the solution in the


way that looks best at the moment.
• Example: start with 5. We add: (5,6), (5,1), (1,3), (3, 4), (1, 2)
• Implementation

• Keep a top-light heap of all external vertices based on their dis-


tance to the current tree (and store to which tree vertex they
connect at that distance).
• Initially, all distances are ∞ except for the neighbors of the start-
ing vertex.
• Repeatedly take the closest vertex f and add its edge to the cur-
rent tree.
• For all external neighbors b of f , perhaps f is a better way to
connect b to the tree; if so, update b’s information in the heap.
(Remove b and reinsert it with the better distance.)
CS315 Spring 2021 59

• Complexity: O(v · log v + e), because for we add each vertex once, re-
moving it from a heap that can have v elements; we need to consider
each edge twice (once from each end).

67 Kruskal’s algorithm

1 Start with all vertices, no edges.


2 do v − 1 times
3 add the lowest-cost missing edge that does not form a cycle

• This is a greedy algorithm: at each step, improve the solution in the


way that looks best at the moment.
• We can stop when we have added v − 1 edges; all the rest will cer-
tainly introduce cycles.
• Data representation: List of edges, sorted by weight
• Complexity: assuming that keeping track of the component of each
vertex is O(log∗ v), the complexity is O(e log e + v log∗ v), because we
must sort the edges and then add v − 1 edges.

68 Cycle detection: Union-find


• general idea

• As edges are added, keep track of which connected compone-


nent every vertex belongs to.
• Any new edge connecting vertices already in the same compo-
nent would form a cycle; avoid adding such edges.

• operations

• Each vertex starts as a separate component.


• union(b,c): assign b and c to the same component (for instance,
when an edge is introduced between them).
• find(b): tell which component b is in (if b and c are in the same
component, don’t add an edge connecting them).

• method for union(b,c)


CS315 Spring 2021 60

• Every vertex has at most one parent, initially nil.


• Find the representative b’ of b by following parent links until
the end.
• Find the representative c’ of c.
• If b’ = c’, they are already in the same component. Done.
• Point either b’ to c’ or c’ to b’ by introducing a parent link be-
tween them.
• We want trees to be as shallow as possible. So record the height
of each tree in its root. Point the shallower one at the deeper
one.
• We can compress paths while searching for the representative.
In this case, the height recorded in the root is just an estimate.

• We use this data structure in Kruskal’s algorithm to avoid cycles:


1 typedef struct vertex_s {
2 int name; // need not be int
3 struct vertex_s *representative; // NULL => me
4 int depth; // only if I represent my group; 0 initially
5 } vertex_t;

• Class 22, 4/13/2021


• More examples of Union-Find

69 Numerical algorithms
• We will not look at algorithms for approximation to problems using
real numbers; that is the subject of CS321.
• We will study integer algorithms.
CS315 Spring 2021 61

70 Euclidean algorithm: greatest common divi-


sor (GCD)
• Examples: gcd(12,60)=12, gcd(15,66)=3, gcd(15,67)=1.
1 int gcd(a, b) {
2 while (b != 0) {
3 (a,b) = (b, a % b);
4 }
5 return(a);
6 } // gcd

a 12 60 12
• Example:
b 60 12 0
a 15 66 15 6 3
• Example:
b 66 15 6 3 0
a 15 67 15 7 1
• Example:
b 67 15 7 1 0

71 Fast exponentiation
• Many cryptographic algorithms require raising large integers (thou-
sands of digits) to very large powers (hundreds of digits), modulo a
large number (about 2K bits).
• to get a64 we only need six multiplications: (((((a2 )2 )2 )2 )2 )2
• to get a5 we need three multiplications: a4 · a = (a2 )2 · a.
• General rule to compute ae : look at the binary representation of e,
read it from left to right. The initial accumulator has value 1.

• 0: square the accumulator


• 1: square the accumulator and multiply by a.

• Example: a11 . In binary, 1110 is expressed as 10112 . So we get


((((12 )a)2 )2 · a)2 · a, a total of 4 squares and 3 multiplications, or 7
operations. The first square is always 12 and the first multiplication
is 1 · a; we can avoid those trivial operations.
• In cryptography, we often need to compute ae (mod p). Calculate
this quantity by performing mod p after each multiplication.
CS315 Spring 2021 62

• As we read the binary representation of e from left to right, we could


start with the leading 0’s without any harm.
• Example (run with the bc calculator program): 243745 mod 452. 74510 =
10111010012 .
1 a = 243
2 m = 452
3 r = 1
4 r = rˆ2*a % m
5 r = rˆ2 % m
6 r = rˆ2*a % m
7 r = rˆ2*a % m
8 r = rˆ2*a % m
9 r = rˆ2 % m
10 r = rˆ2*a % m
11 r = rˆ2 % m
12 r = rˆ2 % m
13 r = rˆ2*a % m
14 r

72 Integer multiplication
• Class 23, 4/15/2021
• The BigNum representation: linked list of pieces, each with, say, 2
bytes of unsigned integer, with least-significant piece first. (It makes
no difference whether we store those 2 bytes in little-endian or big-
endian.)
• Ordinary multiplication of two n-digit numbers x and y costs n2 .
• Anatoly Karatsuba (1962) showed a divide-and-conquer method that
is better.
• Split each number into two chunks, each with n/2 digits:
• x = a · 10n/2 + b
• y = c · 10n/2 + d
The base 10 is arbitrary; the same idea works in any base, such
as 2.
• Now we can calculate xy = ac10n + (bc + ad)10n/2 + bd. This cal-
culation uses four multiplications, each costing (n/2)2 , so it still
CS315 Spring 2021 63

costs n2 . All the additions and shifts (multiplying by powers of


10) cost just O(n), which we ignore.
• We can use the Recursion Theorem (page 17 ): cn = n + 4cn/2 .
Then a = 4, b = 2, k = 1, so a > bk , so cn = Θ(nlogb (a) ) =
Θ(nlog2 (4) ) = Θ(n2 ).
• But we can introduce u = ac, v = bd, and w = (a + b)(c + d) at a
cost of (3/4)n2 .
• Now xy = u10n + (w − u − v)10n/2 + v, which costs no further
multiplications.
• Example
• x = 3962, y = 4481
• a = 39, b = 62, c = 44, d = 81
• u = ac = 1716, v = bd = 5022, w = (a + b)(c + d) = 12625
• w − u − v = 5887
• xy = 17753722.
• In bc:
1 x = 3962

2 y = 4481

3 a = 39

4 b = 62

5 c = 44

6 d = 81

7 u = a*c

8 v = b*d

9 w = (a+b)*(c+d)

10 x * y

11 u*10ˆ4 + (w-u-v)*10ˆ2 + v

• We can apply this construction recursively. cn = n + 3cn/2 . We


can again apply the Recursion Theorem (page 17 ): a = 3, b = 2,
k = 1, so a > bk , so cn = Θ(nlogb a ) = Θ(nlog2 3 ) ≈ Θ(n1.585 ).
• For small n, this improvement is small. But for n = 100, we
reduce the cost from 10, 000 to about 1480. Running bc -l:
1 power=l(3)/l(2)
2 a=100
3 e(power*l(a))
CS315 Spring 2021 64

1 bigInt bigMult(bigInt x, y; int n) {


2 // n-chunk multiply of x and y
3 bigInt a, b, c, d, u, v, w;
4 if (n == 1) return(toBigInt(toInt(x)*toInt(y)));
5 a = extractPart(x, 0, n/2 - 1); // high part of x
6 b = extractPart(x, n/2, n-1); // low part of x
7 c = extractPart(y, 0, n/2 - 1); // high part of y
8 d = extractPart(y, n/2, n-1); // low part of y
9 u = bigMult(a, c, n/2); // recursive
10 v = bigMult(b, d, n/2); // recursive
11 w = bigMult(bigAdd(a,b), bigAdd(c,d), n/2); // recursive
12 return(
13 bigAdd(
14 bigShift(u, n),
15 bigAdd(
16 bigShift(bigSubtract(w, bigAdd(u,v)), n/2),
17 v
18 ) // add
19 ) // add
20 );
21 }

73 Strings and pattern matching — Text search


problem
• Class 24, 4/20/2021
• The problem: Find a match for pattern p within a text t, where |p| =
m and |t| = n.
• Application: t is a long string of bytes (a “message”), and p is a short
string of bytes (a “word”).
• We will look at several algorithms; there are others.

• Brute force: O(mn). Typical: 1.1n (operations).


• Rabin-Karp: O(n). Typical: 7n.
• Knuth-Morris-Pratt: O(n) Typical: 1.1n.
• Boyer-Moore: worst O(mn). Typical: n/m.
CS315 Spring 2021 65

• Non-classical version: approximate match, regular expressions, more


complicated patterns.

74 Text search — brute force algorithm


• Return the smallest index j such that t[j .. j + m − 1] = p, or −1 if
there is no match.
1 int bruteSearch(char *t, char *p) {
2 // returns index in t where p is found, or -1
3 const int n = strlen(t);
4 const int m = strlen(p);
5 int tIndex = 0;
6 p[m] = 0xFF; // impossible character; pseudo-data
7 while (tIndex+m <= n) { // there is still room to find p
8 int pIndex = 0;
9 while (t[tIndex+pIndex] == p[pIndex]) // enlarge match
10 pIndex += 1;
11 if (pIndex == m) return(tIndex); // hit pseudo-data
12 tIndex += 1;
13 } // there is still room to find p
14 return(-1); // failure
15 } // bruteSearch

• Example: p = ”001”, t = ”010001”.


• Worst case: O((n − m)m) = O(nm)
• If the patterns are fairly random, we observe complexity O(n − m) =
O(n); in practice, complexity is about 1.1n.

75 Text search — Rabin-Karp


• Michael Rabin, Richard Karp (1987)
• The idea is to do a preliminary hash-based check each time we incre-
ment tIndex and skip this value of tIndex if there is no chance that
this position works.
• Problem: how can we avoid m accesses to compute the hash of the
next piece of t?
CS315 Spring 2021 66

• We will start with fingerprinting, a weak version of the final method,


just looking at parity, and assuming the strings are composed of 0
and 1 characters.
• The parity of a string of 0 and 1 characters is 0 if the number of 1
characters is even; otherwise the parity is 1.
• Formula: parity =
P
j p[j] (mod 2).
• We can compute the parities of windows of m(= 6) bits in t. For ex-
ample,
j 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
t 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 0 1 1
tParity 1 1 0 1 0 1 0 1 0 1 0 0 1 1
• Say that p = 010111, which has pParity = 0. We only need to
consider matches starting at positions 2, 4, 6, 8, 10, and 11.
• We have saved half the work.
• We can calculate tParity quickly as we move p by looking at only
2, not p, characters of t:

• Initially, tParity0 = 0≤j<m t[j] (mod 2).


P

• Then, tParityj+1 = tParityj + t[j] + t[j + m] (mod 2)


CS315 Spring 2021 67

1 bit computeParity(bit *string, int length) {


2 bit answer = 0;
3 for (int index = 0; index < length; index += 1) {
4 answer += string[index];
5 }
6 return (answer & 01);
7 } // computeParity
8

9 int fingerprintSearch(bit *t, bit *p) {


10 const int n = strlen(t);
11 const int m = strlen(p);
12 const int pParity = computeParity(p, m);
13 int tParity = computeParity(t, m); // initial substring
14 int tIndex = 0;
15 while (tIndex+m <= n) { // there is still room to find p
16 if (tParity == pParity) { // parity check ok
17 int pIndex = 0;
18 while (t[tIndex+pIndex] == p[pIndex]) { // enlarge match
19 pIndex += 1;
20 if (pIndex >= m) return(tIndex);
21 } // enlarge match
22 } // parity check ok
23 tParity = (tParity + t[tIndex] + t[tIndex+m]) & 01;
24 tIndex += 1;
25 } // there is still room to find p
26 return(-1); // failure
27 } // fingerprintSearch

• Instead of bits, we can deal with character arrays.

• We generalize parity to the exclusive OR of characters, which


are just 8-bit quantities.
• The C operator for exclusive OR is ˆ.
• The update rule for tParity is
tParity = tParity ˆ t[tIndex] ˆ t[tIndex+m];

• We now have reduced the work to 1/128 (for 7-bit ASCII), not
1/2, for the random case, because only that small fraction of
starting positions are worth pursuing.
CS315 Spring 2021 68

• The full algorithm extends fingerprinting.

• Instead of reducing the work to 1/2 or 1/128, we want to reduce


it to 1/q for some large q.
• Use this hash function for m bytes t[j] . . . t[j + m − 1]:
m−1−i
t[j+i] (mod q). Experience suggests that q should
P
0≤i<m 2
be a prime > m.
• We can still update tParity quickly as we move p by looking
at only 2, not p, characters of t:
tParityj+1 = (t[j + m] + 2(tParityj − 2m−1 t[j])) (mod q).
• We can use shifting to compute tParity without multiplica-
tion: tParityj+1 = (t[j+m]+(tParityj −(t[j] << (m−1)) << 1)
(mod q). We still need to compute mod q, however.

• Class 25, 4/22/2021


• Monte-Carlo substring search

• Choose q, a prime q close to but not exceeding mn2 . For in-


stance, if m = 10 and n = 1000, choose a prime q near 107 , such
as 9,999,991.
• The probability 1/q that we will make a mistake is very low, so
just omit the inner loop. We will sometimes have a false posi-
tive, with probability, it turns out, less than 2.53/n.
• I don’t think we save enough computation to warrant using
Monte Carlo search. If false positives are very rare, it doesn’t
hurt to employ even a very expensive algorithm to remove them.
Checking anyway is called the “Las-Vegas version”.

• The idea is good, but in practice Rabin-Karp takes about 7n work.

76 Text search — Knuth–Morris–Pratt


• Donald Knuth, James Morris, Vaughan Pratt, 1970-1977.
• Consider t = Tweedledee and Tweedledum, p = Tweedledum.
• After running the inner loop of brute-force search to the u in p, we
have learned much about t, enough to realize that none of the letters
up to that point in t (except the first) are T. So the next place to start
a match in t is not position 1, but position 8.
CS315 Spring 2021 69

• Consider t = pappappappar, p = pappar.


• After running the inner loop of brute-force search to the r in p, we
have learned much about t, enough to realize that the first place in t
that can match p starts not at position 1, but rather in position 3 (the
third p). Moving p to that position lets us continue in the middle of
p, never retreating in t at all.
• How much to shift p depends on how much of it matches when we
encounter a mismatch in the inner loop. This shift table describes
the first example.
p T w e e d l e d u m
k -1 0 1 2 3 4 5 6 7 8 9
shift 1 1 2 3 4 5 6 7 8 9 10
• If our match fails at p[8], use shift[7]=8 to reposition the pattern.
• Here is the shift table for the second example.
p p a p p a r
k -1 0 1 2 3 4 5
shift 1 1 2 2 3 3 6
• Try matching that p against t = pappappapparrassanuaragh.
1 int KMPSearch(char *t, char *p) {
2 const int n = strlen(t);
3 const int m = strlen(p);
4 int tIndex = 0;
5 int pIndex = 0;
6 char shiftTable[m];
7 computeShiftTable(p, shiftTable);
8 while (tIndex+m <= n) { // there is still room to find p
9 while (t[tIndex+pIndex] == p[pIndex]) { // enlarge match
10 pIndex += 1;
11 if (pIndex >= m) return(tIndex);
12 } // enlarge match
13 const int shiftAmount = shiftTable[pIndex - 1];
14 tIndex += shiftAmount;
15 pIndex = max(0, pIndex-shiftAmount);
16 } // there is still room to find p
17 return(-1); // failure
18 } // KMPSearch

• Unfortunately, computing the shift table, although O(m), is not straight-


CS315 Spring 2021 70

forward, so we omit it.


• The overall cost is guaranteed O(n + m), but m < n, so O(n). In
practice, it makes about 1.1n comparisons.

77 Text search — Boyer – Moore simple


• Robert S. Boyer, J. Strother Moore (1977)
• We start by modifying bruteSearch to search from the end of p
backwards.
1 int backwardSearch(char *t, char *p) {
2 const int n = strlen(t);
3 const int m = strlen(p);
4 int tIndex = 0;
5 while (tIndex+m <= n) { // there is still room to find p
6 int pIndex = m-1;
7 while (t[tIndex+pIndex] == p[pIndex]) { // enlarge match
8 pIndex -= 1;
9 if (pIndex < 0) return(tIndex);
10 } // enlarge match
11 tIndex += 1;
12 } // there is still room to find p
13 return(-1); // failure
14 } // backwardSearch

• Occurrence heuristic: At a mismatch, say at letter α in t, shift p to


align the rightmost occurrence of α in p with that α in the text. But
don’t move p to the left. If α does not occur at all in p, move p to one
position after α.
• Method: Initialize location array for p:
CS315 Spring 2021 71

1 int location[256];
2 // location[c] is the last position in p holding char c
3

4 void initLocation(char *p) {


5 const int m = strlen(p);
6 for (int charVal = 0; charVal < 256; charVal += 1) {
7 location[charVal] = -1;
8 }
9 for (int pIndex = 0; pIndex < m; pIndex += 1) {
10 location[p[pIndex]] = pIndex;
11 }
12 } // initLocation

• Let α be the failure character, which is found at a particular pIndex


and tIndex.
• Slide p: tIndex += max(1, pIndex - location[α])
• This formula works in all cases.

• α not in p and pIndex = m-1 ⇒ a full shift: tIndex += m


• α not in p and pIndex = j ⇒ a partial shift, larger if we haven’t
travelled far along p: tIndex += pIndex + 1
• α is in p. We shift enough to align the rightmost α of p with the
one we failed on, or at least shift right by 1.

• Examples

• p = rum, t = conundrum. We shift p by 3, another 3, and find the


match.
• p = drum, t = conundrum. We shift p by 1, by 4, and find the
match.
• p = natu, t = conundrum. We shift p by 2, then fail.
• p = date, t = detective. We would shift p left, so we just shift
right 1, then 4, then fail.

• Class 26, 4/27/2021


• Match heuristic: Use a shift table (organized for right-to-left search)
as with the Knuth–Morris–Pratt algorithm.
• Use both the occurrence and the match heuristics, and shift by the
larger of the two suggestions.
CS315 Spring 2021 72

• Horspool’s version (Nigel Horspool, 1980): on a mismatch, look at


β, which is the element in t where we started matching, that is, β =
ti+m−1 . Shift so that β in t aligns with the rightmost occurrence of β
in p (not counting pm−1 ).
• This method always shifts p to the right.
• We need to precompute for each letter of the alphabet where its
rightmost occurrence in p is, not counting pm−1 . In particular:
• shift[β] = if β in p0..m−2 then m − 1 − max{j|j < m − 1, pj = β}
else m.

78 Advanced pattern matching, as in Perl


• Based on regular expressions; can be compiled into finite-state au-
tomata.
• exact: conundrum
• don’t-care symbols: con.ndr..
• character classes: c[ou1-5]nundrum
• alternation: c(o|u)nund(rum|ite)
• repetition:
• c(on)*und
• c(on)+und
• c(on){4,5}und
• predefined character classes: c\wnundrum\d\W
• Unicode character classes:
c\p{ASCII}nundrum\p{digit}\p{Final_Punctuation}
• pseudo-characters: ˆconundrum$
• Beyond regular expressions in Perl
• Reference to ”capture groups”: con(un|an)dr\1m
• Zero-width assertions: (?=conundrum)

79 Edit distance
• How much do we need to change s (source) to make it look like d
(destination) ?
CS315 Spring 2021 73

• Charge 1 for each replacement (R), deletion (D), insertion (I).


D I R
• Example: ghost → host → houst → house
• The edit distance (s,d) is the smallest number of operations to trans-
form s to d.
• We can build an edit-distance table d by this rule:
di,j = min(di−1,j + 1, di,j−1 + 1, di−1,j−1 + if s[i] = d[j] then 0 else 1).
• Example: peseta → presto (should get distance 3).
-1 0 1 2 3 4 5
p e s e t a
-1 0 1 2 3 4 5 6
0 p 1 0 1 2 3 4 5
1 r 2 1 1 2 3 4 5
2 e 3 2 1 2 2 3 4
3 s 4 3 2 1 2 3 4
4 t 5 4 3 2 2 2 3
5 o 6 5 4 3 3 3 3
• We can trace back from the last cell to see exactly how to navigate to
the start cell: pick any smallest neighbor to left/above.
• ↓: delete a character from source (left string)
• →: insert a character from destination (top string)
• &: keep the same character (if number the same) or replace a
character in the source (left string) with one from the destination
(top string).
• complexity: O(nm) to calculate the array; the preprocessing is just to
start up the array, of cost O(n + m).
• Another example: convert banana to antenna. It should take only
4 edits.
• Class 27, 4/29/2021
• This algorithm is in the dynamic programming category. Pascal’s
triangle is another, as is finding the rectangle in an array with the
largest sum of values (some negative).

80 Categories of algorithms
• Divide and conquer
CS315 Spring 2021 74

• Greedy
• Dynamic programming
• Search

81 Divide and conquer algorithms


• steps

• if the problem size n is trivial, do it.


• divide the problem into a easier problems of size n/b.
• do the a easier problems
• combine the answers.

• We can usually compute the complexity by the Recursion Theorem


(page 17).
• cost: nk for splitting, recomputing, so Cn = nk + aCn/b .
• Select jth smallest element of an array. a = 1, b = 2, k = 1 ⇒ O(n).
• Quicksort. a = 2, b = 2, k = 1 ⇒ O(n log n).
• Binary search, search in a binary tree. a = 1, b = 2, k = 0 ⇒ O(log n).
• Multiplication (Karatsuba) a = 3, b = 2, k = 1 ⇒ O(nlog2 3 ).
• Tile an n × n board that is missing a single cell by a trimino: a = 4,
b = 4, k = 0 ⇒ O(n).

• Mergesort:
1 void mergeSort(int array[], int lowIndex, int highIndex){
2 // sort array[lowIndex] .. array[highIndex]
3 if (highIndex - lowIndex < 1) return; // width 0 or 1
4 int mid = (lowIndex+highIndex)/2;
5 mergeSort(array, lowIndex, mid);
6 mergeSort(array, mid+1, highIndex);
7 merge(array, lowIndex, highIndex);
8 } // mergeSort
a = 2, b = 2, k = 1 ⇒ O(n log n).
CS315 Spring 2021 75

82 Greedy algorithms
General rule: Enlarge the current solution by selecting (usually in a simple
way) the best single-step improvement.

• Computing the coins for change: greedily apply the biggest available
coin first.

• not always optimal: consider denominations 1, 6, 10, and we


wish to construct 12.
• Denominations 1, 5, 10 guarantee optimality.
• Power-of-two coins would be very nice: no more than 1 of each
needed for change. British measures follow this rule: fluid ounce
: tablespoon : quarter-gill : half-gill : gill : cup : pint : quart :
half gallon : gallon
• Similar problem: putting weights on barbells.

• Kruskal’s algorithm for computing a minimum-cost spanning tree:


greedily add edges of increasing weight, avoiding cycles.
• Prim’s algorithm for computing a minimum-cost spanning tree: greed-
ily enlarge the current spanning tree with the shortest edge leading
out.
• Dijkstra’s algorithm for all shortest paths from a source: greedily
pick the cheapest extension of all paths so far.
• Hoffman codes for data compression

• Start with a table of frequencies, like this one:


space 60
A 22
O 16
R 13
S 6
T 4
A text containing all these characters in the given frequencies
would take 60 + 22 + 16 + 13 + 6 + 4 = 121 7-bit units or 847 bits.
• Build a table of codes, like this one:
CS315 Spring 2021 76

space 0
A 111
O 110
R 101
S 1001
T 1000
The same text now uses 60 · 1 + 22 · 3 + . . . + 4 · 4 = 253 bits.
• To decode: follow a tree:
1 0

1 0 space

1 0 1 0

A O R 1 0

S T

• To build the tree

• Each character is a node.


• Greedily take the two least common nodes, combine them as
children of a new parent, and label that parent with the com-
bined frequency of the two children.

• Adding a million real numbers, all in the range 0 . . . 1, losing minimal


precision

• Remove the two smallest numbers from the set. (This step is
greedy: take the numbers whose sum can be computed with
the least precision loss.)
• Insert their sum in the set.
• Use a heap to represent the set.

• Continuous knapsack problem

• Given a set of n objects xi , each with a weight wi and profit pi ,


and a total weight capacity C, select objects (to put in a knap-
sack) that together weigh ≤ C and maximize profit. We are
allowed to take fractions of an object.
CS315 Spring 2021 77

• Greedy method
• Start with an empty knapsack.
• Sort the objects in decreasing order of pi /wi .
• Greedy step: Take all of each object in the list, if it fits. If it
fits partially, take a fraction of the object, then done.
• Stop when the knapsack is full.
• Example.
chapter pages (weight) importance (profit) ratio
1 120 5 .0417
2 150 5 .0333
3 200 4 .0200
4 150 8 .0533
5 140 3 .0214
sorted: 4, 1, 2, 5, 3. If capacity C = 600, take all of 4, 1, 2, 5, and
40/200 of 3.
• This greedy algorithm happens to be optimal.
• 0/1 knapsack problem: Same as before, but no fractions are allowed.
The greedy algorithm is still fast, but it is not guaranteed optimal.

83 Dynamic programming
General rule: Solve all smaller problems and use their solutions to com-
pute the solution to the next problem.
• Compute Fibonacci numbers: fi = fi−1 + fi−2 .
• Compute binomial coefficients: C(n, i) = C(n − 1, i − 1) + C(n − 1, i).
• Compute minimal edit distance.

84 Summary of algorithms covered


• Class 28, 5/4/2021
• Graphs
• Computing the degree of all vertices: Loop over representation
• Computing the connected component containing node i in an
undirected graph: Depth-first search, recursive, avoiding ver-
tices already visited
CS315 Spring 2021 78

• Breadth-first search: use a queue, iterative, avoiding vertices


already visited, perhaps with back-pointers
• Shortest path between nodes i and j: use a priority queue (heap)
sorted by distance from i.
• Topological sort: recursive; build list as the last step.
• Dijkstra’s algorithm: Finding all shortest paths from given node
Greedy: extend currently shortest path
• Prim’s algorithm for spanning trees: Greedy, repeatedly add
shortest outgoing edge
• Kruskal’s algorithm for spanning trees: Greedy, repeatedly add
shortest edge that does not build a cycle.
• Cycle detection: Union-find: All vertices in a component point
(possibly indirectly) to a representative; union joins representa-
tives.

• Numerical algorithms

• Euclidean algorithm for greatest common divisor (GCD): Re-


peatedly take modulus.
• Fast exponentiation: represent the exponent in binary to guide
the steps
• Integer multiplication (Karatsuba): Subdivide a problem of size
n × n into three problems of size n/2 × n/2.

• Strings and pattern matching — Text search problem

• Text search — Brute-force algorithm: Try each position for the


pattern p.
• Text search — Rabin-Karp: Hash-based pre-check each time p
moves over.
• Text search — Knuth–Morris–Pratt: Precomputed shift table tells
how far to move p on a mismatch.
• Text search — Boyer – Moore simple: Match starting at the end
of p; can jump great distances.
• Edit distance — Dynamic programming, finding edit distance
of several subproblems to guide the next subproblem.

• Miscellaneous

• Tiling (divide and conquer)


CS315 Spring 2021 79

• Mergesort (divide and conquer)


• Computing Fibonacci numbers (dynamic programming)
• Computing binomial coefficients (dynamic programming)
• Continuous knapsack problem (greedy)
• Coin changing (greedy)
• Hoffman codes (greedy)

85 Tractability
• Formal definition of O: f (n) = O(g(n)) iff for adequately large n,
and some constant c, we have f (n) ≤ c · g(n). That is, f is bounded
above by some multiple of g. We can say that f grows no faster than
g.
• Formal definition of Θ: f (n) = Θ(g(n)) iff for adequately large n, and
some constants c1 , c2 , we have c1 · g(n) ≤ f (n) ≤ c2 · g(n). We say that
f grows as fast as g.
• Formal definition of Ω is similar; f (n) = Ω(g(n)) means that f grows
at least as fast as g.
• We usually say that a problem is tractable if we can solve it in poly-
nomial time (with respect to the problem size n). We also say the
program is efficient.
• constant time: O(1)
• logarithmic time: O(log n)
• linear time: O(n)
• sub-quadratic: O(n log n) (for instance)
• quadratic time: O(n2 )
• cubic time: O(n3 )
• These are all bounded by O(nk ) for some fixed k.
• However, if k is large, even tractable problems can be infeasible
to solve. In practice, algorithms seldom have k > 3.
• There are many algorithms that take more than polynomial time.
• exponential: O(2n ).
• super-exponential: O(n!) (for example)
• O(nn )
n
• O(22 )
CS315 Spring 2021 80

86 Decision problems, function problems, P, NP


• Decision problems: The answer is just “yes” or “no”.
• Primality: is n prime? (There are very fast probabilistic algo-
rithms, and recently a polynomial algorithm).
• Is there a path from a to b shorter than 10?
• Are two graphs G and F isomorphic? (Apparently very hard)
• Can graph G be colored with 3 colors? (Apparently very hard)
• Function problems: the answer is a number.
• What is the smallest prime divisor of n?
• What is the weight of a minimum-weight spanning tree of G?
• We use P to refer to the set of decision problems that can be decided
in polynomial time. That is, for all problems p ∈ P , there must be
an algorithm and a positive number k such that the time of the algo-
rithm for p is O(|x|k ) where |x| means the size of x.
• We use NP to refer to the set of decision problems that can be decided
in polynomial time if we are allowed to guess a witness to a “yes”
answer and only need to check it.
• Is there a path from a to b shorter than 10? Guess the path, find
its length. O(1).
• Are two graphs G and F isomorphic? Guess the isomorphism,
then check, requiring O(v + e).
• Can graph G be colored with 3 colors? Guess the coloring, then
demonstrate that it is right; O(v + e).
• Is there a set of Boolean values for variables x1 , ..., xn that sat-
isfies a given Boolean formula (using ”and”, ”or”, and ”not”)?
Guess the values, check in linear time (in the length of the for-
mula).
• Properties of P and NP (and EXP, decision problems that can be
solved in O(k n )).
• P ⊆ N P ⊆ EXP
• P ⊂ EXP
• if a problem in NP has g possible witnesses, then it has an algo-
rithm in O(g n ).
CS315 Spring 2021 81

• Some problems can be proved to be “hardest” in N P . The are


called N P -complete problems. All other problems in N P can
be reduced to such N P -complete problems.
• Nobody knows, but people suspect that P ⊂ N P ⊂ EXP .

You might also like