COA Mod 3
COA Mod 3
COA Mod 3
Array Multiplier
Binary multiplication can be implemented in a combinational two-dimensional logic array
called array multiplier.
The main component in each in each cell is a full adder, FA.
The AND gate in each cell determines whether a multiplicand bit mj, is added to the
incoming partial product bit based on the value of the multiplier bit, qi.
Each row i, where 0<= i <=3, adds the multiplicand (appropriately shifted) to the
incoming parcel product, PPi, to generate the outgoing partial product, PP(i+1), if
qi.=1.
1
KTU - CST202 - Computer Organization and Architecture Module: 3
If qi.=0, PPi is passed vertically downward unchanged. PP0 is all 0’s and PP4 is the
desired product. The multiplication is shifted left one position per row by the diagonal
signal path.
Disadvantages:
2
(1) An n bit by n bit array multiplier requires n AND gates and n(n-2) full adders and n
half adders.(Half aders are used if there are 2 inputs and full adder used if there are 3
inputs).
(2) The longest part of input to output through n adders in top row, n -1 adders in the
bottom row and n-3 adders in middle row. The longest in a circuit is called critical
path.
Sequential Circuit Multiplier
Multiplication is performed as a series of (n) conditional addition and shift operation such
that if the given bit of the multiplier is 0 then only a shift operation is performed, while if the
given bit of the multiplier is 1 then addition of the partial products and a shift operation are
performed.
The combinational array multiplier uses a large number of logic gates for multiplying
numbers. Multiplication of two n-bit numbers can also be performed in a sequential circuit that
uses a single n bit adder.
The block diagram in Figure shows the hardware arrangement for sequential
multiplication. This circuit performs multiplication by using single n-bit adder n times to
implement the spatial addition performed by the n rows of ripple-carry adders in Figure. Registers
A and Q are shift registers, concatenated as shown. Together, they hold partial product PPi while
multiplier bit qi generates the signal Add/Noadd. This signal causes the multiplexer MUX to
select 0 when qi = 0, or to select the multiplicand M when qi = 1, to be added to PPi to generate
PP(i + 1). The product is computed in n cycles. The partial product grows in length by one bit per
cycle from the initial vector, PP0, of n 0s in register A. The carryout from the adder is stored in
flipflop C, shown at the left end of the register C.
2
KTU - CST202 - Computer Organization and Architecture Module: 3
Algorithm:
(1) The multiplier and multiplicand are loaded into two registers Q and M. Third register
A and C are cleared to 0.
(2) In each cycle it performs 2 steps:
(a) If LSB of the multiplier qi =1, control sequencer generates Add signal which
adds the multiplicand M with the register A and the result is stored in A.
(b) If qi =0, it generates Noadd signal to restore the previous value in register A.
(3) Right shift the registers C, A and Q by 1 bit
3
KTU - CST202 - Computer Organization and Architecture Module: 3
[To extend the sign bit - since its 5 bit signed operand, 10 bit product should be generated.
So, if the partial product’s MSB is 1, add 1 for sign extension (to left),
if the partial product’s MSB is 0, add 0 for sign extension (to left)]
Qn- BR=10111
Qn AC Q Qn+1 SC
1 BR'+1=01001
Initial 00000 10011 0 101
00000+
SUB 01001
1 0
01001 10011 0 101
ASHR 00100 11001 1 100
1 1 ASHR 00010 01100 1 011
00010+
ADD 10111
0 1
11001 01100 1 011
ASHR 11100 10110 0 010
0 0 ASHR 11110 01011 0 001
11110
SUB 01001
1 0
00111 01011 0 001
ASHR 00011 10101 1 000
5
KTU - CST202 - Computer Organization and Architecture Module: 3
Qn- BR=01101
Qn AC Q Qn+1 SC
1 BR'+1=10011
Initial 00000 11010 0 101
0 0 ASHR 00000 01101 0 100
00000+
SUB 10011
1 0
10011 01101 0 100
ASHR 11001 10110 1 011
11001+
ADD 01101
0 1
00110 10110 1 011
ASHR 00011 01011 0 010
00011
SUB 10011
1 0
10110 01011 0 010
ASHR 11011 00101 1 001
1 1 ASHR 11101 10010 1 000
[13x-6 will give a –ve product. so the resultant product’s 2’s compliment should be
determined]
Resultant Product in A and Q = 11101 10010
2’s complement = 00010 01101+
1
--------------------
0001001110
=26+23+22+21
. = -78
==================
6
KTU - CST202 - Computer Organization and Architecture Module: 3
Qn- BR=10101
Qn AC Q Qn+1 SC
1 BR'+1=01011
Initial 00000 01000 0 101
0 0 ASHR 00000 00100 0 100
0 0 ASHR 00000 00010 0 011
0 0 ASHR 00000 00001 0 010
00000+
SUB 01011
1 0
01011 00001 0 010
ASHR 00101 10000 1 001
00101
ADD 10101
0 1
11010 10000 1 010
ASHR 11101 01000 0 000
[-11x8 will give a –ve product. so the resultant product’s 2’s compliment should be
determined]
7
KTU - CST202 - Computer Organization and Architecture Module: 3
Multiply each of the following pairs of signed 2’s complement number using Booth’s
algorithm. In each of the cases assume A is the multiplicand and B is the multiplier.
A=010111 B=110110
Answer:
A=010111 B=110110
[sign bit is 0, therefore +ve number] [sign bit is 1, therefore -ve number]
Find 2’s complement.
A=23 [10111] 2’s complement of 10110 is 01001+ 1
= 01010 => 10
Therefore, A=+23 [010111] Therefore, B= -10 [110110]
BR=10101
Qn Qn-1 AC Q Qn+1 SC
BR'+1=01011
Initial 000000 110110 0 0110
0 0 ASHR 000000 011011 0 0101
000000+
SUB 101001
1 0
101001 011011 0 0101
ASHR 110100 101101 1 0100
1 1 ASHR 111010 010110 1 0011
111010+
ADD 010111
0 1
010001 010110 1 0011
ASHR 001000 101011 0 0010
001000+
1 0 SUB 101001
110001 101011 0 0010
ASHR 111000 110101 1 0001
1 1 ASHR 111100 011010 1 0000
8
KTU - CST202 - Computer Organization and Architecture Module: 3
[+23x-10 will give a –ve product. so the resultant product’s 2’s compliment should be
determined]
Booth algorithm works equally well for both negative and positive multipliers.
Booth algorithm deals with signed multiplication of given number.
Speed up the multiplication process.
In general, in the Booth algorithm, −1 times the shifted multiplicand is selected when moving
from 0 to 1, and +1 times the shifted multiplicand is selected when moving from1 to 0, as the
multiplier is scanned from right to left. The case when the LSB of the multiplier is 1, it is
handled by assuming that an implied 0 lies to its right.
In worst case multiplier, numbers of addition and subtraction operations are large.
In ordinary multiplier, 0 indicates no operation, but still there are addition and
subtraction operations to be performed.
In good multiplier, booth algorithm works well because majority are 0s .
A good multiplier consists of block/sequence of 1s.
9
KTU - CST202 - Computer Organization and Architecture Module: 3
Booth algorithm achieves efficiency in the number of additions required when the multiplier had
a few large blocks of 1s. The speed gained by skipping over 1s depends on the data. On average,
the speed of doing multiplication with the booth algorithm is the same as with the normal
multiplication
• Best case – a long string of 1’s (skipping over 1s)
• Worst case – 0’s and 1’s are alternating
• The transformation 011….110 to 100….0-10 is called skipping over 1s.
INTEGER DIVISION
Figure shows examples of decimal division and binary division of the same
values.Consider the decimal version first. The 2 in the quotient is determined by the
followingreasoning: First, we try to divide 13 into 2, and it does not work. Next, we try to divide
13into 27. We go through the trial exercise of multiplying 13 by 2 to get 26, and, observing
that27 − 26 = 1 is less than 13, we enter 2 as the quotient and perform the required subtraction.
10
KTU - CST202 - Computer Organization and Architecture Module: 3
The next digit of the dividend, 4, is brought down, and we finish by deciding that 13
goesinto 14 once and the remainder is 1. We can discuss binary division in a similar way, withthe
simplification that the only possibilities for the quotient bits are 0 and 1.
A circuit that implements division by this longhand method operates as follows:
Itpositions the divisor appropriately with respect to the dividend and performs a subtraction.If the
remainder is zero or positive, a quotient bit of 1 is determined, the remainder isextended by
another bit of the dividend, the divisor is repositioned, and another subtractionis performed.
Figure 1
Restoring Division
Figure shows a logic circuit arrangement that implements the restoring divisionalgorithm
just discussed. An n-bit positive divisor is loaded into registerMand an n-bit positive dividendis
loaded into register Q at the start of the operation. Register A is set to 0. After thedivision is
complete, the n-bit quotient is in register Q and the remainder is in register A.
11
KTU - CST202 - Computer Organization and Architecture Module: 3
The required subtractions are facilitated by using 2’s-complement arithmetic. The extra
bitposition at the left end of bothAandMaccommodates the sign bit during subtractions.
Thefollowing algorithm performs restoring division.
Do the following three steps n times:
1. Shift A and Q left one bit position.
2. Subtract M from A, and place the answer back in A.
3. If the sign of A is 1, set q0 to 0 and add M back to A (that is, restore A); otherwise,
setq0 to 1.
Figure 2
PIPELINING
Pipelining is a technique of decomposing a sequential process into sub operations, with each
sub process being executed in a special dedicated segment that operates concurrently with
12
KTU - CST202 - Computer Organization and Architecture Module: 3
Pipeline Organization
The simplest way of viewing the pipeline structure is to imagine that each segment
consists of an input register followed by a combinational circuit. The register holds the data
and the combinational circuit performs the sub operation in the particular segment. The
output of the combinational circuit is applied to the input register of the next segment. A
clock is applied to all registers after enough time has elapsed to perform all segment
activity. In thisway the information flows through the pipeline one step at a time.
Example demonstrating the pipeline organization
13
KTU - CST202 - Computer Organization and Architecture Module: 3
Suppose we want to perform the combined multiply and add operations with a stream of
numbers.
Ai*Bi + Ci for i=1, 2, 3 ….7
Each sub operation is to implemented in a segment within a pipeline. Each segment has one
or two registers and a combinational circuit as shown in fig.
R1 through r5 are registers that receive new data with every clock pulse.
The multiplier and adder are combinational circuits. The sub operations performed
in each segment of the pipeline are as follows:
R1<- Ai R2<-Bi Input Ai and Bi
R3<-R1*R2 R4<-Ci multiply and input Ci
R5<-R3+R4 add Ci to product
The five registers are loaded with new data every clock pulse.
14
KTU - CST202 - Computer Organization and Architecture Module: 3
The first clock pulse transfers A1 and B1 into R1 and R2. The second clock pulse
transfers the product of R1 and R2 into R3 and C1 into R4. The same clock pulse transfers
A2 and B2 into R1 and R2. The third clock pulse operates on all three segments
simultaneously. It places A3 and B3 into R1 and R2, transfers the product of R1 and R2 into
R3, transfers C2 into R4, and places the sum of R3 and R4 into R5. It takes three clock
pulses to fill up the pipe and retrieve the first output from R5. From there on, each clock
produces a new output and moves the data one step down the pipeline. This happens as long
as new input data flow into the system.
15
KTU - CST202 - Computer Organization and Architecture Module: 3
segment 2 is busy with T1, while segment 1 is busy with task T2. Continuing in this manner,
the first task T1 is completed after fourth clock cycle. From then on, the pipe completes a
task every clock cycle.
Consider the case where a k-segment pipeline with a clock cycle time tp is used to
execute n tasks. The first task T1 requires a time equal to ktp to complete its operation since
there are k segments in a pipe. The remaining n-1 tasks emerge from the pipe at the rate of
one task per clock cycle and they will be completed after a time equal to (n-1) tp. Therefore,
16
KTU - CST202 - Computer Organization and Architecture Module: 3
Consider a non pipeline unit that performs the same operation and takes a time equal to
tn to complete each task. The total time required for n tasks is n tn. The speedup of a pipeline
processing over an equivalent non pipeline processing is defined by the ratio
S=ntn / (k+n-1)tp
As the number of tasks increases, n becomes much larger than k-1, and k+n-1 approaches the
value of n. under this condition the speed up ratio becomes
S=tn/tp
If we assume that the time it takes to process a task is the same in the pipeline and non
pipeline circuits, we will have tn=ktp. Including this assumption speedup ratio reduces to
S=ktp/tp=k
ARITHMETICPIPELINES
17
KTU - CST202 - Computer Organization and Architecture Module: 3
A and B are two fractions that represent the mantissa and a and bare the exponents. The
floating point addition and subtraction can be performed in four segments. The registers
18
KTU - CST202 - Computer Organization and Architecture Module: 3
labeled are placed between the segments to store intermediate results. The sub operations
that are performed in the four segments are:
INSTRUCTION PIPELINE
An instruction pipeline operates on a stream of instructions by overlapping the
fetch, decode, and execute phases of instruction cycle. An instruction pipeline reads
consecutive instructions from memory while previous instructions are being executed in
other segments. This causes the instruction fetch and executes phases to overlap and perform
simultaneous operations.
Consider a computer with an instruction fetch unit and an instruction execute unit
designed to provide a two segment pipeline. The instruction fetch segment can be
implemented by means of a first in first out (FIFO) buffer. Whenever the execution unit is
not using memory, the control increments the program counter and uses it address value to
read consecutive instructions frommemory. The instructions are inserted into the FIFO buffer
so that they can be executed on a first in first out basis. Thus an instruction stream can be
placed inqueue, waiting for decoding and processing by the execution segment.
19
KTU - CST202 - Computer Organization and Architecture Module: 3
In general the computer needs to process each instruction with the following sequence of
steps.
20
KTU - CST202 - Computer Organization and Architecture Module: 3
Fig shows the instruction cycle in the CPU can be processed with a four segment pipeline.
While an instruction is being executed in segment 4, the next instruction in sequence is busy
with fetching an operand from memory insegment 3. the effective address may be calculated
in a separate arithmetic circuit for the third instruction, and whenever the memory is
available, thefourth and all subsequent instructions are placed in an instruction FIFO.
Fig shows the operation of the instruction pipeline. The time in the horizontal axis is
divided into steps of equal duration. The four segments are represented in the diagram with
an abbreviated symbol.
21
KTU - CST202 - Computer Organization and Architecture Module: 3
2. DA is the segment that decodes the instruction and calculates theeffective address.
It is assumed that the processor has separate instruction and data memories so that the
operation in FI and FO can proceed at the same time. In the absence of a branch instruction,
each segment operates on different instructions. Thus, in step 4, instruction 1 is being
executed in segment EX; the operand for instruction 2 is being fetched into segment FO;
instruction 3 is being decoded in segment DA; and instruction 4 is being fetched from
memoryin segment FI.
Assume now this instruction is a branch instruction. As soon as this instruction is decoded in
segment DA in step 4, the transfer from FI to DA of the other instructions are halted until the
branch instruction is executed in step 6.
PIPELINE CONFLICTS:
1. RESOURCE CONFLICTS: They are caused by access to memory by two segments
at the same time. Most of these conflicts can be resolved by using separate instruction
and data memories.
3. BRANCH DIFFERENCE: they arise from branch and other instructions that change
the value of PC.
22
KTU - CST202 - Computer Organization and Architecture Module: 3
DATA HAZARDS
We must ensure that the results obtained when instructions are executed in a pipelined
processor are identical to those obtained when the same instructions are executed
sequentially.
Hazard occurs
A←3+A
B←4×A
No hazard
A←5×C
B ← 20 + C
When two operations depend on each other, they must be executed sequentially in the
correct order.
Another example:
Mul R2, R3, R4
Add R5, R4, R6
KTU - CST202 - Computer Organization and Architecture Module: 3
Operand Forwarding
Source 1
Source 2
Instead of from the register file, the second instruction can get data directly from the
SRC1 SRC2
Register
f ile
ALU
(a) Datapath
RSLT
SRC1,SRC2 RSLT
E: Execute W: Write
(ALU) (Register f ile)
A special arrangement needs to be made to “forward” the output of ALU to the input of
Figure 8.7. Operand forw
Forwarding path
(b) Position of the source and result registers in the processor pipeline
ALU.
KTU - CST202 - Computer Organization and Architecture Module: 3
SIDE EFFECTS
The previous example is explicit and easily detected.
Sometimes an instruction changes the contents of a register other than the one named as
the destination.
When a location other than one explicitly named in an instruction as a destination
operand is affected, the instruction is said to have a side effect. (Example?)
Example: conditional code flags:
Add R1, R3
AddWithCarry R2, R4
Instructions designed for execution on pipelined hardware should have few side effects.
INSTRUCTION HAZARDS
Whenever the stream of instructions supplied by the instruction fetch unit is interrupted,
the pipeline stalls.
Cache miss
Branch
Time
Clock cy cle 1 2 3 4 5 6
Instruction
I3 F3 X
Ik Fk Ek
Unconditional Branches
Branch Timing
Branch penalty
Reducing the penalty
Time
Clock cy cle 1 2 3 4 5 6 7 8
I1 F1 D1 E1 W1
I 2 (Branch) F2 D2 E2
I3 F3 D3 X
I4 F4 X
Ik Fk Dk Ek Wk
Time
Clock cy cle 1 2 3 4 5 6 7
I1 F1 D1 E1 W1
I3 F3 X
Ik Fk Dk Ek Wk
Conditional Braches
A conditional branch instruction introduces the added hazard caused by the dependency
of the branch condition on the result of a preceding instruction.
The decision to branch cannot be made until the execution of that instruction has been
completed.
Branch instructions represent about 20% of the dynamic instruction count of most
programs.
Delayed Branch
The instructions in the delay slots are always fetched. Therefore, we would like to
arrange for them to be fully executed whether or not the branch is taken.
The objective is to place useful instructions in these slots.
The effectiveness of the delayed branch approach depends on how often it is possible to
reorder instructions.
KTU - CST202 - Computer Organization and Architecture Module: 3
Branch Prediction
To predict whether or not a particular branch will be taken.
Simplest form: assume branch will not take place and continue to fetch instructions in
sequential address order.
Until the branch is evaluated, instruction execution along the predicted path must be done
on a speculative basis.
Speculative execution: instructions are executed before the processor is certain that they
are in the correct execution sequence.
Need to be careful so that no processor registers or memory locations are updated until it
is confirmed that these instructions should indeed be executed.
Branch Prediction
Better performance can be achieved if we arrange for some branch instructions to be
predicted as taken and others as not taken.
Use hardware to observe whether the target address is lower or higher than that of the
branch instruction.
Let compiler include a branch prediction bit.
So far the branch prediction decision is always the same every time a given instruction is
executed – static branch prediction.