Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

COA Mod 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

KTU - CST202 - Computer Organization and Architecture Module: 3

ARITHMETIC ALGORITHMS - Algorithms for multiplication and division


(restoring method) of binary numbers — Array multiplier —Booth’s
multiplication algorithm
Module: 3 Pipelining – Basic Principles, classification of pipeline processors.
instruction and arithmetic pipelines (Design examples not required),
hazard detection and resolution.

MULTIPLICATION OF UNSIGNED NUMBERS

Product of 2 n bit numbers is atmost 2n bit number. Unsigned multiplication can be


viewed as addition of shifted versions of the multiplicand. Multiplication involves the generation
of partial products, one for each digit in the multiplier. These partial products are then summed to
produce the final product. When the multiplier bit is 0, the partial product is 0. When the
multiplier is 1 the partial product is the multiplicand. The total product is produced by summing
the partial products. For this operation, each successive partial product is shifted one position to
the left relative to the preceding partial product.
Multiplication of two integer numbers 13 and 11 is,

Array Multiplier
Binary multiplication can be implemented in a combinational two-dimensional logic array
called array multiplier.
 The main component in each in each cell is a full adder, FA.
 The AND gate in each cell determines whether a multiplicand bit mj, is added to the
incoming partial product bit based on the value of the multiplier bit, qi.
 Each row i, where 0<= i <=3, adds the multiplicand (appropriately shifted) to the
incoming parcel product, PPi, to generate the outgoing partial product, PP(i+1), if
qi.=1.

1
KTU - CST202 - Computer Organization and Architecture Module: 3

 If qi.=0, PPi is passed vertically downward unchanged. PP0 is all 0’s and PP4 is the
desired product. The multiplication is shifted left one position per row by the diagonal
signal path.

(a)Array multiplication of positive binary operands (b) Multiplier cell

Disadvantages:
2
(1) An n bit by n bit array multiplier requires n AND gates and n(n-2) full adders and n
half adders.(Half aders are used if there are 2 inputs and full adder used if there are 3
inputs).
(2) The longest part of input to output through n adders in top row, n -1 adders in the
bottom row and n-3 adders in middle row. The longest in a circuit is called critical
path.
Sequential Circuit Multiplier
Multiplication is performed as a series of (n) conditional addition and shift operation such
that if the given bit of the multiplier is 0 then only a shift operation is performed, while if the
given bit of the multiplier is 1 then addition of the partial products and a shift operation are
performed.
The combinational array multiplier uses a large number of logic gates for multiplying
numbers. Multiplication of two n-bit numbers can also be performed in a sequential circuit that
uses a single n bit adder.
The block diagram in Figure shows the hardware arrangement for sequential
multiplication. This circuit performs multiplication by using single n-bit adder n times to
implement the spatial addition performed by the n rows of ripple-carry adders in Figure. Registers
A and Q are shift registers, concatenated as shown. Together, they hold partial product PPi while
multiplier bit qi generates the signal Add/Noadd. This signal causes the multiplexer MUX to
select 0 when qi = 0, or to select the multiplicand M when qi = 1, to be added to PPi to generate
PP(i + 1). The product is computed in n cycles. The partial product grows in length by one bit per
cycle from the initial vector, PP0, of n 0s in register A. The carryout from the adder is stored in
flipflop C, shown at the left end of the register C.
2
KTU - CST202 - Computer Organization and Architecture Module: 3

Algorithm:
(1) The multiplier and multiplicand are loaded into two registers Q and M. Third register
A and C are cleared to 0.
(2) In each cycle it performs 2 steps:
(a) If LSB of the multiplier qi =1, control sequencer generates Add signal which
adds the multiplicand M with the register A and the result is stored in A.
(b) If qi =0, it generates Noadd signal to restore the previous value in register A.
(3) Right shift the registers C, A and Q by 1 bit

MULTIPLICATION OF SIGNED NUMBERS

We now discuss multiplication of 2’s-complement operands, generating a double-length


product. The general strategy is still to accumulate partial products by adding versions of the
multiplicand as selected by the multiplier bits.
First, consider the case of a positive multiplier and a negative multiplicand. When we
add a negative multiplicand to a partial product, we must extend the sign-bit value of the
multiplicand to the left as far as the product will extend. Figure shows an example in which a 5-
bit signed operand, −13, is the multiplicand. It is multiplied by +11 to get the 10-bit product,
−143. The sign extension of the multiplicand is shown in blue. The hardware discussed earlier
can be used for negative multiplicands if it is augmented to provide for sign extension of the
partial products.
13 –> 1101
+13 -> 01101 [for +ve number, add 0 to MSB]
-13 -> 10010 + [for –ve number, find 2’s complement]
1
--------------
10011 -> -13

3
KTU - CST202 - Computer Organization and Architecture Module: 3

[To extend the sign bit - since its 5 bit signed operand, 10 bit product should be generated.
So, if the partial product’s MSB is 1, add 1 for sign extension (to left),
if the partial product’s MSB is 0, add 0 for sign extension (to left)]

Example: Sign extension of negative multiplicand

[product is10 bits ->(2n)]

For a negative multiplier, a straightforward solution is to form the 2’s-complement of


both the multiplier and the multiplicand and proceed as in the case of a positive multiplier.
This is possible because complementation of both operands does not change the value or the sign
of the product.
[If the sign bit is 0 then the number is positive, If the sign bit is 1, then the number is negative]

The Booth Algorithm


Algorithm & Flowchart for Booth Multiplication
1. Multiplicand is placed in BR and Multiplier in QR
2. Accumulator register AC, Qn+1 and Sequence counter SC
are initialized to 0.
3. Compare Qn and Qn+1 and perform the following
01 –> AC=AC+BR
10 –> AC=AC+BR’+1
00 –> No arithmetic operation
11-> No arithmetic operation
4. ASHR- Arithmetic Shift right AC,QR
5. Decrement SC by 1
The final product will be store in AC, QR
4
KTU - CST202 - Computer Organization and Architecture Module: 3

Multiply -9 x -13 using Booth Algorithm


9 = 1001 13 = 1101 BR= 10111
+9 = 01001 +13 = 01101 BR’+1= 01000+
-9 = 10110+ -13 = 100010+ 1
1 1 --------------
------------------ ------------------- 01001 (BR’+1)
10111 (BR) 10011 (Q)

Qn- BR=10111
Qn AC Q Qn+1 SC
1 BR'+1=01001
Initial 00000 10011 0 101
00000+
SUB 01001
1 0
01001 10011 0 101
ASHR 00100 11001 1 100
1 1 ASHR 00010 01100 1 011
00010+
ADD 10111
0 1
11001 01100 1 011
ASHR 11100 10110 0 010
0 0 ASHR 11110 01011 0 001
11110
SUB 01001
1 0
00111 01011 0 001
ASHR 00011 10101 1 000

Resultant Product in A and Q = 00011 10101


= 26+25+24+22+20
= 117
==============

5
KTU - CST202 - Computer Organization and Architecture Module: 3

Multiply 13 x -6 using Booth Algorithm


13 = 1101 6 = 0110 BR= 01101
+13 = 01101 (BR) +6 = 00110 BR’+1= 10010+
-6 = 11001+ 1
1 --------------
-------------- 10011 (BR’+1)
11010 (Q)

Qn- BR=01101
Qn AC Q Qn+1 SC
1 BR'+1=10011
Initial 00000 11010 0 101
0 0 ASHR 00000 01101 0 100
00000+
SUB 10011
1 0
10011 01101 0 100
ASHR 11001 10110 1 011
11001+
ADD 01101
0 1
00110 10110 1 011
ASHR 00011 01011 0 010
00011
SUB 10011
1 0
10110 01011 0 010
ASHR 11011 00101 1 001
1 1 ASHR 11101 10010 1 000

[13x-6 will give a –ve product. so the resultant product’s 2’s compliment should be
determined]
Resultant Product in A and Q = 11101 10010
2’s complement = 00010 01101+
1
--------------------
0001001110
=26+23+22+21
. = -78
==================
6
KTU - CST202 - Computer Organization and Architecture Module: 3

Multiply -11 x 8 using Booth Algorithm


11 = 1011 8 = 1000 BR= 10101
+11 = 01011 +8 = 01000 (Q) BR’+1= 01010+
-11 = 10100+ - 1
1 --------------
------------------ 01011 (BR’+1)
10101 (BR)

Qn- BR=10101
Qn AC Q Qn+1 SC
1 BR'+1=01011
Initial 00000 01000 0 101
0 0 ASHR 00000 00100 0 100
0 0 ASHR 00000 00010 0 011
0 0 ASHR 00000 00001 0 010
00000+
SUB 01011
1 0
01011 00001 0 010
ASHR 00101 10000 1 001
00101
ADD 10101
0 1
11010 10000 1 010
ASHR 11101 01000 0 000

[-11x8 will give a –ve product. so the resultant product’s 2’s compliment should be
determined]

Resultant Product in A and Q = 11101 01000


2’s complement = 00010 10111+
1
---------------------
0001011000
=26+24+23
= -88
=========

7
KTU - CST202 - Computer Organization and Architecture Module: 3

Multiply each of the following pairs of signed 2’s complement number using Booth’s
algorithm. In each of the cases assume A is the multiplicand and B is the multiplier.
A=010111 B=110110

Answer:
A=010111 B=110110
[sign bit is 0, therefore +ve number] [sign bit is 1, therefore -ve number]
Find 2’s complement.
A=23 [10111] 2’s complement of 10110 is 01001+ 1
= 01010 => 10
Therefore, A=+23 [010111] Therefore, B= -10 [110110]

Multiply +23 x -10


BR= 010111
+23 = 010111 (BR) BR’+1= 101000+
-10 = 001010 (Q) 1
------------
101001 (BR’+1)

BR=10101
Qn Qn-1 AC Q Qn+1 SC
BR'+1=01011
Initial 000000 110110 0 0110
0 0 ASHR 000000 011011 0 0101
000000+
SUB 101001
1 0
101001 011011 0 0101
ASHR 110100 101101 1 0100
1 1 ASHR 111010 010110 1 0011
111010+
ADD 010111
0 1
010001 010110 1 0011
ASHR 001000 101011 0 0010
001000+
1 0 SUB 101001
110001 101011 0 0010
ASHR 111000 110101 1 0001
1 1 ASHR 111100 011010 1 0000

8
KTU - CST202 - Computer Organization and Architecture Module: 3

[+23x-10 will give a –ve product. so the resultant product’s 2’s compliment should be
determined]

Resultant Product in A and Q = 111100 011010


2’s complement = 000011 100101+
1
--------------------------
000011100110
=27 +26+25+22 + 21
= -230
=============

Features of Booth Algorithm:

 Booth algorithm works equally well for both negative and positive multipliers.
 Booth algorithm deals with signed multiplication of given number.
 Speed up the multiplication process.

Booth Recording of a Multiplier:

In general, in the Booth algorithm, −1 times the shifted multiplicand is selected when moving
from 0 to 1, and +1 times the shifted multiplicand is selected when moving from1 to 0, as the
multiplier is scanned from right to left. The case when the LSB of the multiplier is 1, it is
handled by assuming that an implied 0 lies to its right.

 In worst case multiplier, numbers of addition and subtraction operations are large.
 In ordinary multiplier, 0 indicates no operation, but still there are addition and
subtraction operations to be performed.
 In good multiplier, booth algorithm works well because majority are 0s .
 A good multiplier consists of block/sequence of 1s.

9
KTU - CST202 - Computer Organization and Architecture Module: 3

Booth algorithm achieves efficiency in the number of additions required when the multiplier had
a few large blocks of 1s. The speed gained by skipping over 1s depends on the data. On average,
the speed of doing multiplication with the booth algorithm is the same as with the normal
multiplication
• Best case – a long string of 1’s (skipping over 1s)
• Worst case – 0’s and 1’s are alternating
• The transformation 011….110 to 100….0-10 is called skipping over 1s.

INTEGER DIVISION

Figure shows examples of decimal division and binary division of the same
values.Consider the decimal version first. The 2 in the quotient is determined by the
followingreasoning: First, we try to divide 13 into 2, and it does not work. Next, we try to divide
13into 27. We go through the trial exercise of multiplying 13 by 2 to get 26, and, observing
that27 − 26 = 1 is less than 13, we enter 2 as the quotient and perform the required subtraction.

10
KTU - CST202 - Computer Organization and Architecture Module: 3

The next digit of the dividend, 4, is brought down, and we finish by deciding that 13
goesinto 14 once and the remainder is 1. We can discuss binary division in a similar way, withthe
simplification that the only possibilities for the quotient bits are 0 and 1.
A circuit that implements division by this longhand method operates as follows:
Itpositions the divisor appropriately with respect to the dividend and performs a subtraction.If the
remainder is zero or positive, a quotient bit of 1 is determined, the remainder isextended by
another bit of the dividend, the divisor is repositioned, and another subtractionis performed.

Figure 1

If the remainder is negative, a quotient bit of 0 is determined, the dividend isrestored by


adding back the divisor, and the divisor is repositioned for another subtraction.This is called the
restoring division algorithm.

Restoring Division
Figure shows a logic circuit arrangement that implements the restoring divisionalgorithm
just discussed. An n-bit positive divisor is loaded into registerMand an n-bit positive dividendis
loaded into register Q at the start of the operation. Register A is set to 0. After thedivision is
complete, the n-bit quotient is in register Q and the remainder is in register A.

11
KTU - CST202 - Computer Organization and Architecture Module: 3

The required subtractions are facilitated by using 2’s-complement arithmetic. The extra
bitposition at the left end of bothAandMaccommodates the sign bit during subtractions.
Thefollowing algorithm performs restoring division.
Do the following three steps n times:
1. Shift A and Q left one bit position.
2. Subtract M from A, and place the answer back in A.
3. If the sign of A is 1, set q0 to 0 and add M back to A (that is, restore A); otherwise,
setq0 to 1.

Figure 2

PIPELINING

Pipelining is a technique of decomposing a sequential process into sub operations, with each
sub process being executed in a special dedicated segment that operates concurrently with

12
KTU - CST202 - Computer Organization and Architecture Module: 3

all other segments.


A pipeline can be visualized as a collection of processing segments through
which binary information flows. Each segment performs partial processing dictated by the
way the task is partitioned.
The result obtained from the computation in each segment is transferred to
the next segment in the pipeline. The final result is obtained after the data have passed
through all segments.

Pipeline Organization
The simplest way of viewing the pipeline structure is to imagine that each segment
consists of an input register followed by a combinational circuit. The register holds the data
and the combinational circuit performs the sub operation in the particular segment. The
output of the combinational circuit is applied to the input register of the next segment. A
clock is applied to all registers after enough time has elapsed to perform all segment
activity. In thisway the information flows through the pipeline one step at a time.
Example demonstrating the pipeline organization

13
KTU - CST202 - Computer Organization and Architecture Module: 3

Suppose we want to perform the combined multiply and add operations with a stream of
numbers.
Ai*Bi + Ci for i=1, 2, 3 ….7
Each sub operation is to implemented in a segment within a pipeline. Each segment has one
or two registers and a combinational circuit as shown in fig.

R1 through r5 are registers that receive new data with every clock pulse.
The multiplier and adder are combinational circuits. The sub operations performed
in each segment of the pipeline are as follows:
R1<- Ai R2<-Bi Input Ai and Bi
R3<-R1*R2 R4<-Ci multiply and input Ci
R5<-R3+R4 add Ci to product
The five registers are loaded with new data every clock pulse.

14
KTU - CST202 - Computer Organization and Architecture Module: 3

The first clock pulse transfers A1 and B1 into R1 and R2. The second clock pulse
transfers the product of R1 and R2 into R3 and C1 into R4. The same clock pulse transfers
A2 and B2 into R1 and R2. The third clock pulse operates on all three segments
simultaneously. It places A3 and B3 into R1 and R2, transfers the product of R1 and R2 into
R3, transfers C2 into R4, and places the sum of R3 and R4 into R5. It takes three clock
pulses to fill up the pipe and retrieve the first output from R5. From there on, each clock
produces a new output and moves the data one step down the pipeline. This happens as long
as new input data flow into the system.

FOUR SEGMENT Pipeline


The general structure of four segment pipeline is shown in fig. the operands are passed
through all four segments in affixed sequence. Each segment consists of a combinational
circuit Si that performs a sub operation over the data stream flowing through the pipe. The
segments are separated by registers Ri that hold the intermediate results between the stages.
Information flows between adjacent stages under the control of a common clock applied to
all the registerssimultaneously.

15
KTU - CST202 - Computer Organization and Architecture Module: 3

SPACE TIME DIAGRAM:


The behavior of a pipeline can be illustrated with a space time diagram. This is a diagram
that shows the segment utilization as a function of time.
Fig The horizontal axis displays the time in clock cycles and the vertical axis gives the
segment number. The diagram shows six tasks T1 through T6 executed in four segments.
Initially, task T1 is handled by segment 1. After thefirst clock,

segment 2 is busy with T1, while segment 1 is busy with task T2. Continuing in this manner,
the first task T1 is completed after fourth clock cycle. From then on, the pipe completes a
task every clock cycle.
Consider the case where a k-segment pipeline with a clock cycle time tp is used to
execute n tasks. The first task T1 requires a time equal to ktp to complete its operation since
there are k segments in a pipe. The remaining n-1 tasks emerge from the pipe at the rate of
one task per clock cycle and they will be completed after a time equal to (n-1) tp. Therefore,
16
KTU - CST202 - Computer Organization and Architecture Module: 3

to complete n tasksusing a k segment pipeline requires


k+ (n-1) clock cycles.

Consider a non pipeline unit that performs the same operation and takes a time equal to
tn to complete each task. The total time required for n tasks is n tn. The speedup of a pipeline
processing over an equivalent non pipeline processing is defined by the ratio

S=ntn / (k+n-1)tp
As the number of tasks increases, n becomes much larger than k-1, and k+n-1 approaches the
value of n. under this condition the speed up ratio becomes
S=tn/tp
If we assume that the time it takes to process a task is the same in the pipeline and non
pipeline circuits, we will have tn=ktp. Including this assumption speedup ratio reduces to
S=ktp/tp=k

CLASSIFICATION OF PIPELINE PROCESSORS

ARITHMETICPIPELINES

An arithmetic pipeline divides an arithmetic operation into sub operations for


execution in the pipeline segments. Pipeline arithmetic units are usually found in very high
speed computers. They are used to implement floating point operations, multiplication of
fixed point numbers, and similar computations encountered in scientific problems.
Pipeline Unit For Floating Point Addition And Subtraction:
The inputs to the floating point adder pipeline are two normalized floating point
binary numbers.
X=A*2a
Y=B*2b

17
KTU - CST202 - Computer Organization and Architecture Module: 3

A and B are two fractions that represent the mantissa and a and bare the exponents. The
floating point addition and subtraction can be performed in four segments. The registers
18
KTU - CST202 - Computer Organization and Architecture Module: 3

labeled are placed between the segments to store intermediate results. The sub operations
that are performed in the four segments are:

1. Compare the exponents


2. Align the mantissa.
3. Add or subtract the mantissas.
4. Normalize the result.
The exponents are compared by subtracting them to determine their difference. The
larger exponent is chosen as the exponent of the result. The exponent difference determines
how many times the mantissa associated with the smaller exponent must be shifted to the
right. This produces an alignment ofthe two mantissas.
The two mantissas are added or subtracted in segment3. The result is normalized in
segment 4. When an overflow occurs, the mantissa of the sum or difference is shifted to right
and the exponent incremented by one. If the underflow occurs, the number of leading zeroes
in the mantissa determines the number of left shits in the mantissa and the number that must
be subtracted from the exponent.

INSTRUCTION PIPELINE
An instruction pipeline operates on a stream of instructions by overlapping the
fetch, decode, and execute phases of instruction cycle. An instruction pipeline reads
consecutive instructions from memory while previous instructions are being executed in
other segments. This causes the instruction fetch and executes phases to overlap and perform
simultaneous operations.
Consider a computer with an instruction fetch unit and an instruction execute unit
designed to provide a two segment pipeline. The instruction fetch segment can be
implemented by means of a first in first out (FIFO) buffer. Whenever the execution unit is
not using memory, the control increments the program counter and uses it address value to
read consecutive instructions frommemory. The instructions are inserted into the FIFO buffer
so that they can be executed on a first in first out basis. Thus an instruction stream can be
placed inqueue, waiting for decoding and processing by the execution segment.
19
KTU - CST202 - Computer Organization and Architecture Module: 3

In general the computer needs to process each instruction with the following sequence of
steps.

1. Fetch the instruction.


2. Decode the instruction.
3. Calculate the effective address.
4. Fetch the operands from memory.
5. Execute the instruction.
6. Store the result in the proper place.

Four Segment Instruction Pipeline

20
KTU - CST202 - Computer Organization and Architecture Module: 3

Fig shows the instruction cycle in the CPU can be processed with a four segment pipeline.
While an instruction is being executed in segment 4, the next instruction in sequence is busy
with fetching an operand from memory insegment 3. the effective address may be calculated
in a separate arithmetic circuit for the third instruction, and whenever the memory is
available, thefourth and all subsequent instructions are placed in an instruction FIFO.

Fig shows the operation of the instruction pipeline. The time in the horizontal axis is
divided into steps of equal duration. The four segments are represented in the diagram with
an abbreviated symbol.

21
KTU - CST202 - Computer Organization and Architecture Module: 3

1. FI is the segment that fetches an instruction.

2. DA is the segment that decodes the instruction and calculates theeffective address.

3. FO is the segment that fetches the operand.

4. EX is the segment that executes the instruction.

It is assumed that the processor has separate instruction and data memories so that the
operation in FI and FO can proceed at the same time. In the absence of a branch instruction,
each segment operates on different instructions. Thus, in step 4, instruction 1 is being
executed in segment EX; the operand for instruction 2 is being fetched into segment FO;
instruction 3 is being decoded in segment DA; and instruction 4 is being fetched from
memoryin segment FI.
Assume now this instruction is a branch instruction. As soon as this instruction is decoded in
segment DA in step 4, the transfer from FI to DA of the other instructions are halted until the
branch instruction is executed in step 6.

PIPELINE CONFLICTS:
1. RESOURCE CONFLICTS: They are caused by access to memory by two segments
at the same time. Most of these conflicts can be resolved by using separate instruction
and data memories.

2. DATA DEPENDENCY: these conflicts arise when an instruction depends on the


result of a previous instruction, but this result is not yet available.

3. BRANCH DIFFERENCE: they arise from branch and other instructions that change
the value of PC.

22
KTU - CST202 - Computer Organization and Architecture Module: 3

PIPELINE HAZARDS DETETCTION AND RESOLUTION

DATA HAZARDS
 We must ensure that the results obtained when instructions are executed in a pipelined
processor are identical to those obtained when the same instructions are executed
sequentially.
 Hazard occurs
A←3+A
B←4×A
 No hazard
A←5×C
B ← 20 + C
 When two operations depend on each other, they must be executed sequentially in the
correct order.
 Another example:
Mul R2, R3, R4
Add R5, R4, R6
KTU - CST202 - Computer Organization and Architecture Module: 3

Operand Forwarding
Source 1
Source 2

 Instead of from the register file, the second instruction can get data directly from the
SRC1 SRC2

Register
f ile

ALU

output of ALU after the previous instruction is completed.


Destination

(a) Datapath
RSLT

SRC1,SRC2 RSLT

E: Execute W: Write
(ALU) (Register f ile)

 A special arrangement needs to be made to “forward” the output of ALU to the input of
Figure 8.7. Operand forw
Forwarding path

(b) Position of the source and result registers in the processor pipeline

a rding in a pipelined processor


.

ALU.
KTU - CST202 - Computer Organization and Architecture Module: 3

Handling Data Hazards in Software


 Let the compiler detect and handle the hazard:
I1: Mul R2, R3, R4
NOP
NOP
I2: Add R5, R4, R6
 The compiler can reorder the instructions to perform some useful work during the NOP
slots.

SIDE EFFECTS
 The previous example is explicit and easily detected.
 Sometimes an instruction changes the contents of a register other than the one named as
the destination.
 When a location other than one explicitly named in an instruction as a destination
operand is affected, the instruction is said to have a side effect. (Example?)
 Example: conditional code flags:
Add R1, R3
AddWithCarry R2, R4
 Instructions designed for execution on pipelined hardware should have few side effects.

INSTRUCTION HAZARDS
 Whenever the stream of instructions supplied by the instruction fetch unit is interrupted,
the pipeline stalls.
 Cache miss
 Branch
Time
Clock cy cle 1 2 3 4 5 6

Instruction

KTU - CST202 - Computer Organization and Architecture Module: 3


I1 F1 E1

I 2 (Branch) F2 E2 Execution unit idle

I3 F3 X

Ik Fk Ek

I k+1 Fk+1 Ek+1

Figure 8.8. An idle ycle


c caused by a branch instruction.

Unconditional Branches

Branch Timing
 Branch penalty
 Reducing the penalty
Time
Clock cy cle 1 2 3 4 5 6 7 8

I1 F1 D1 E1 W1

I 2 (Branch) F2 D2 E2

I3 F3 D3 X

I4 F4 X

Ik Fk Dk Ek Wk

I k+1 Fk+1 Dk+1 Ek+1

(a) Branch address computed in


ecute
Ex stage

Time
Clock cy cle 1 2 3 4 5 6 7

I1 F1 D1 E1 W1

KTU - CST202 - Computer Organization and Architecture Module: 3


I 2 (Branch) F2 D2

I3 F3 X

Ik Fk Dk Ek Wk

I k+1 Fk+1 D k+1 Ek+1

(b) Branch address computed in Decode stage

Figure 8.9. Branch timing.

Instruction Queue and Prefetching


KTU - CST202 - Computer Organization and Architecture Module: 3

Conditional Braches
 A conditional branch instruction introduces the added hazard caused by the dependency
of the branch condition on the result of a preceding instruction.
 The decision to branch cannot be made until the execution of that instruction has been
completed.
 Branch instructions represent about 20% of the dynamic instruction count of most
programs.

Delayed Branch
 The instructions in the delay slots are always fetched. Therefore, we would like to
arrange for them to be fully executed whether or not the branch is taken.
 The objective is to place useful instructions in these slots.
 The effectiveness of the delayed branch approach depends on how often it is possible to
reorder instructions.
KTU - CST202 - Computer Organization and Architecture Module: 3

Branch Prediction
 To predict whether or not a particular branch will be taken.
 Simplest form: assume branch will not take place and continue to fetch instructions in
sequential address order.
 Until the branch is evaluated, instruction execution along the predicted path must be done
on a speculative basis.
 Speculative execution: instructions are executed before the processor is certain that they
are in the correct execution sequence.
 Need to be careful so that no processor registers or memory locations are updated until it
is confirmed that these instructions should indeed be executed.

Incorrectly Predicted Branch


KTU - CST202 - Computer Organization and Architecture Module: 3

Branch Prediction
 Better performance can be achieved if we arrange for some branch instructions to be
predicted as taken and others as not taken.
 Use hardware to observe whether the target address is lower or higher than that of the
branch instruction.
 Let compiler include a branch prediction bit.
 So far the branch prediction decision is always the same every time a given instruction is
executed – static branch prediction.

You might also like