1 Rouf
1 Rouf
1 Rouf
2
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
3
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
CLASSES OF
COMPUTERS
Personal Mobile Device (PMD)
• e.g. smart phones, tablet computers
• Emphasis on energy efficiency and real-time for media apps
Desktop Computing
• Emphasis on price-performance
4
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
CLASSES OF
COMPUTERS (CONTD..)
Servers
•
Emphasis on availability, scalability, throughput
•
Clusters / Warehouse Scale Computers
•
Used for “Software as a Service (SaaS)”
•
Emphasis on availability and price-performance
•
Sub-class: Supercomputers, emphasis: floating-point
performance and fast internal networks
Embedded Computers
• Microwaves, washing machines, printers, networking switches
• Emphasis: price
5
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
CURRENT TRENDS
Cannot continue to exploit Instruction-Level parallelism (ILP)
• Single processor performance improvement ended in 2003
6
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
PARALLELISM
Classes of parallelism in applications:
• Data-Level Parallelism (DLP)
• Task-Level Parallelism (TLP)
Classes of architectural parallelism:
• Instruction-Level Parallelism (ILP)
• Exploit DLP
• Vector architectures/Graphic Processor Units (GPUs)
• Exploit DLP
• Thread-Level Parallelism
• Exploit DLP or TLP
• Request-Level Parallelism
• Exploit TLP
7
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
LAYER OF SYSTEM ARCHITECTURE
8
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
DEFINING COMPUTER
ARCHITECTURE
The task of computer designer:
Determine what attributes are important for a new
computer, then design a computer to maximize performance
while staying within cost, power, and availability constrains
9
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
DEFINING COMPUTER
ARCHITECTURE
This task has many aspects:
• Instruction set design
• Functional organization
• Logic design
• And implementation
Also,
• Integrated circuit design
• Packaging
• Power
• Cooling
AND
• Optimization, including a lot of technologies (complier, OS…)
10
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
INSTRUCTION SET
ARCHITECTURE (ISA)
The instruction set architecture
serves as the boundary between
the software and hardware.
11
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
TRENDS IN
TECHNOLOGY
To evaluate a computer, designer must
be aware of rapid changes in
implementation technology
• Integrated circuit logic:
• transistor density increase by about 35% per year
• Increase in die size is ranging from 10% to 20%
per year
• The combined effect is a growth rate in transistor
count on a chip is about 40%~55% per year
12
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
TRENDS IN
TECHNOLOGY
• DRAM (dynamic random-access memory):
• Capacity increases by about 40% per year, doubling
roughly every two years
13
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
CSE 4821
Advanced Computer Architecture
LECTURE 3
2
EVOLUTION OF INSTRUCTION SETS
Single Accumulator (EDSAC 1950, Maurice Wilkes)
Accumulator + Index Registers
(Manchester Mark I, IBM 700 series 1953)
CISC RISC
Intel x86, Pentium (MIPS,Sparc,HP-PA,IBM RS6000,PowerPC . . .1987)
3
CLASSIFYING ISAS
Accumulator (before 1960, e.g. 68HC11):
1-address add A acc acc + mem[A]
4
OPERAND LOCATIONS IN FOUR ISA CLASSES
GPR
5
WORD-ORIENTED MEMORY
ORGANIZATION
32-bit 64-bit Bytes Addr.
Words Words
0000
Memory is byte addressed and Addr
= 0001
provides access for bytes (8 0000
?? 0002
bits), half words (16 bits), words Addr
= 0003
(32 bits), and double words(64 0000
?? 0004
bits). Addr
= 0005
0004
?? 0006
0007
Addresses Specify Byte Locations 0008
Addr
• Address of first byte in word = 0009
• Addresses of successive words differ 0008
??
Addr
0010
by 4 (32-bit) or 8 (64-bit) = 0011
0008
??
0012
Addr
= 0013
0012
?? 0014
0015
6
BYTE ORDERING
How should bytes within multi-byte word be ordered in memory?
Conventions
• Sun’s, Mac’s are “Big Endian” machines
• Least significant byte has highest address
• Alphas, PC’s are “Little Endian” machines
• Least significant byte has lowest address
7
BYTE ORDERING EXAMPLE
Big Endian
• Least significant byte has highest address
Little Endian
• Least significant byte has lowest address
Example
• Variable x has 4-byte representation 0x01234567
• Address given by &x is 0x100
8
TYPES OF OPERATIONS
Arithmetic and Logic: AND, ADD
Data Transfer: MOVE, LOAD, STORE
Control: BRANCH, JUMP, CALL
System: OS CALL, VM
Floating Point: ADDF, MULF, DIVF
Decimal: ADDD, CONVERT
String: MOVE, COMPARE
Graphics: (DE)COMPRESS
9
TOP 10 80X86 INSTRUCTIONS
° Rank instruction Integer Average Percent total executed
1 load 22%
2 conditional branch 20%
3 compare 16%
4 store 12%
5 add 8%
6 and 6%
7 sub 5%
8 move register-register 4%
9 call 1%
10 return 1%
Total 96%
° Simple instructions dominate instruction frequency
10
RELATIVE FREQUENCY OF
CONTROL INSTRUCTIONS
11
THE MIPS INSTRUCTION FORMATS
All MIPS instructions are 32 bits long. The three instruction formats:
31 26 21 16 11 6 0
• R-type op rs rt rd shamt funct
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits
31 26 21 16 0
• I-type immediate
op rs rt
6 bits 5 bits 5 bits 16 bits
• J-type 31 26 0
op target address
6 bits 26 bits
The different fields are:
• op: operation of the instruction
• rs, rt, rd: the source and destination register specifiers
• shamt: shift amount
• funct: selects the variant of the operation in the “op” field
• address / immediate: address offset or immediate value
• target address: target address of the jump instruction
12
MIPS ADDRESSING MODES/INSTRUCTION FORMATS
Register (direct) op rs rt rd
register
Immediate op rs rt immed
Displacement
op rs rt immed
Memory
register +
PC-relative
op rs rt immed
Memory
PC +
13
REVIEW: 5-STAGE
EXECUTION
5 canonical stage “RISC” load-store architecture
1. Instruction fetch (IF):
• get instruction from memory/cache
2. Instruction decode, Register read (ID):
• translate opcode into control signals and read regs
3. Execute (EX):
• perform ALU operation, load/store address, branch outcomes
4. Memory (MEM):
• access memory if load/store, everyone else idle
5. Writeback/retire (WB):
• write results to register file
14
SOLUTION
Overlap execution of instructions
• Start instruction on every cycle, e.g. the new instruction can be fetched while the
previous one is decoded – pipeline. Each cycle performing a specific task; number of
stages is called pipeline depth (5 here)
Non-pipelined
time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
Pipelined
15
Pipeline Progress – Instn moves with all control signals, addresses, data items =>
different register lengths at different stages
M
U
X
1 + target
+ PC+1 PC+1
R0 0
R1
eq?
regA ALU
M
instruction
regB R2 result
R3
valA U
Inst A X
PC ALU
Register file
R4 L mdata
mem result
R5 U
valB M Data
R6
U memory
R7 data
X
offset dest
valB
Bits 11-15
M
Bits 16-20 U dest dest dest
X
16
DATA HAZARD - STALLING
0 2 4 6 8 10 12 16 18
add $s0,$t0,$t1 W
IF ID EX MEM s0 $s0
written
here
STALL
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE
STALL
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE
sub $t2,$s0,$t3 R
IF s0 EX MEM WB
$s0 read
here
17
DATA HAZARDS
Two different instructions use the same storage
location
• It must appear as if they executed in sequential order
18
Control Hazards CONTROL HAZARD ON BRANCHES
THREE STAGE STALL
ALU
10: beq r1,r3,36 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
14: and r2,r3,r5
ALU
Reg Reg
18: or r6,r1,r7 Ifetch DMem
ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9
ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg
19
CONTROL HAZARDS
Branch problem:
• branches are resolved in EX stage
3 cycles penalty on taken branches
Ideal CPI =1. Assuming 3cycles for all branches and 32% branch
instructions new CPI = 1 + 0.32*3 = 1.96
Solutions:
• Reduce branch penalty: change the datapath – new adder needed
in ID stage.
• Fill branch delay slot(s) with a useful instruction.
• Fixed branch prediction.
• Static branch prediction.
• Dynamic branch prediction.
20
Pipeline: Hazards
Dr. M. A. Rouf
Professor
Dept. of CSE
Pipelining Outline
• Introduction
– Defining Pipelining
– Pipelining Instructions
• Hazards
– Structural hazards \
– Data Hazards
– Control Hazards
• Performance
• Controller implementation
CSE-4821 by Professor Dr. M. A. Rouf 2
Pipeline Hazards
• Where one instruction cannot immediately follow
another
• Types of hazards
– Structural hazards - attempt to use the same resource
by two or more instructions
– Control hazards - attempt to make branching decisions
before branch condition is evaluated
– Data hazards - attempt to use data before it is ready
• Can always resolve hazards by waiting
7
CSE-4821 by Professor Dr. M. A. Rouf
Executing Multiple Instructions
Clock Cycle 3
ADD SW LW
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8
Memory Conflict
Load
ALU
I Ifetch Reg DMem Reg
n
s
Instr 1
ALU
Ifetch Reg DMem Reg
t
r.
Instr 2
ALU
Ifetch Reg DMem Reg
O
r
d Stall Bubble Bubble Bubble Bubble Bubble
e
r Instr 3
ALU
Ifetch Reg DMem Reg
Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
register $2: 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Program
execution
order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg
The use of the result of the SUB instruction in the next three instructions causes a data
hazard, since the register $2 is not written until after those instructions read it.
I: add r1,r2,r3
J: sub r4,r1,r3
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
– Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
register $2: 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Program
execution IF/ID ID/EX EX/MEM MEM/WB
order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg
add $s0,$t0,$t1 W
IF ID EX MEM s0 $s0
written
here
STALL
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE
STALL
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE
sub $t2,$s0,$t3 R
IF s0 EX MEM WB
$s0 read
here
ID W
lw $s0,20($t1) IF ID EX MEM s0
new value
of s0
STALL
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE
R
sub $t2,$s0,$t3 IF s0 EX MEM WB
Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
register $2: 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Program
execution IF/ID ID/EX EX/MEM MEM/WB
order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg
Program
execution order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg
Assumption:
• The register file forwards values that are read and
written during the same
CSE-4821 cycle.
by Professor Dr. M. A. Rouf 32
Data Hazard Summary
• Three types of data hazards
– RAW (MIPS)
– WAW (not in MIPS)
– WAR (not in MIPS)
• Solution to RAW in MIPS
– Stall
– Forwarding
• Detection & Control
– EX hazard
– MEM hazard
• A stall is needed if read a register after a load
instruction that writes the same register.
– Reordering
CSE-4821 by Professor Dr. M. A. Rouf 33
Pipelining Outline
• Introduction
– Defining Pipelining
– Pipelining Instructions
• Hazards
– Structural hazards
– Data Hazards \
– Control Hazards
• Performance
• Controller implementation
CSE-4821 by Professor Dr. M. A. Rouf 34
Data Hazard Review
• Three types of data hazards
– RAW (in MIPS and all others)
– WAW (not in MIPS but many others)
– WAR (not in MIPS but many others)
• Forwarding
add $s0,$t0,$t1 W
IF ID EX MEM s0 $s0
written
here
STALL
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE
STALL
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE
sub $t2,$s0,$t3 R
IF s0 EX MEM WB
$s0 read
here
A branch is either
– Taken: PC <= PC + 4 + Immediate
– Not Taken: PC <= PC + 4
ALU
10: beq r1,r3,36 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
14: and r2,r3,r5
ALU
Reg
18: or r6,r1,r7 Ifetch Reg DMem
ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9
ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg
STALL
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE
sw $s4,200($t5) IF ID EX MEM WB
beq
writes PC new PC
here used here
CSE-4821 by Professor Dr. M. A. Rouf 45
Control Hazard - Correct Prediction
0 2 4 6 8 10 12 16 18
tgt:
sw $s4,200($t5) IF ID EX MEM WB
Fetch assuming
branch taken
tgt:
sw $s4,200($t5) IF
(incorrect - ST ALL) BUBBLE BUBBLE BUBBLE BUBBLE
or $r8,$r8,$r9 IF ID EX MEM WB
“Squashed”
instruction
CSE-4821 by Professor Dr. M. A. Rouf 47
INSTRUCTION
LEVEL
PARALLELISM
LECTURE 3
2
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
ILP CHALLENGES (CONTROL FLOW
GRAPH)
3
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
DEFINITION OF ILP
ILP=Potential overlap of execution among instructions.
Overlapping possible if:
• No Structural Hazards
• No RAW, WAR or WAW Stalls
• No Control Stalls
4
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
HARDWARE SCHEMES
TO EXPLOIT ILP
Why?
• Works when can’t know real dependence at compile time
• Compiler Simpler
• Code for one machine runs well on another
5
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
KEY IDEA:
• Allow instructions behind stall to
proceed
• Enables out-of-order execution and
completion (commit).
• First implemented in CDC 6600
(1963).
6
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
EXAMPLE:
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F12,F8,F14
7
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
DEPENDENCE
• Data Dependence: True dependence
• Name dependence: RAW or WAW or RAR
• Control Dependence
• Data Dependency:
• Instruction i produces a result that may be used by
instruction j
• Instruction j is data dependent on instruction k and k is
data dependent on instruction i (transitive dependency)
8
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
DEPENDENCE
Data Dependence
9
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
NAME DEPENDENCE
• Name dependence:
• It occurs when two or more instructions use the
same register or memory location, called a name
but there is no flow of data between instructions
associated with the name.
• Two types of name dependence
• Anti dependence: Write After Read (WAR)
• Output dependence: Write After Write (WAW)
10
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD SCHEME
• ID stage splitted in two parts:
• Issue (decode and check structural
hazards.).
• Read Operands (wait until no data
hazards).
• Scoreboard allow instructions without
dependencies to execute.
11
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD IMPLICATIONS
• Out-of-order completion -> WAR and
WAW hazards.
• Solutions for WAR:
• Queue both the operations and copies of its
operands.
• Read registers only during Read Operands stage.
12
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD IMPLICATIONS
• For WAW, the machine stalls until the
other instruction completes
• Multiple execution units
• Scoreboard keeps track of dependencies
and state of operations.
13
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
FOUR STAGES OF
SCOREBOARD CONTROL
1. Issue
• Decode instructions & check for structural hazards.
• If a functional unit for the instruction is free and no other
active instruction has the same destination register
(WAW), the scoreboard issues the instruction to the
functional unit and updates its internal data structure.
• If a structural or a WAW hazard exists, then the instruction
issue stalls, and no further instructions will issue until these
hazards are cleared.
14
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
FOUR STAGES OF
SCOREBOARD CONTROL
2. Read Operands
• Wait until no data hazards, then read operands
• A source operand is available if:
- no earlier issued active instruction will write it or
- A functional unit is writing its value in a register
• When the source operands are available, the scoreboard
tells the functional unit to proceed to read the operands
from the registers and begin execution.
• RAW hazards are resolved dynamically in this step, and
instructions may be sent into execution out of order.
15
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
FOUR STAGES OF
SCOREBOARD CONTROL
3. Execution
• Operate on operands
• The functional unit begins execution upon receiving
operands. When the result is ready, it notifies the
scoreboard that it has completed execution.
16
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
FOUR STAGES OF
PIPELINE CONTROL
4. Write result
• Finish execution
• Once the scoreboard is aware that the
unit has completed execution, the
scoreboard checks for WAR hazards.
• If none, it writes results.
• If WAR, then it stalls the instruction.
17
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
WAR EXAMPLE
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F8,F8,F14
In this case, the scoreboard would stall the SUBD
in the WB stage,waiting that ADDD reads F0 and F8.
18
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD STRUCTURE
1. Instruction status
2. Functional Unit status
Indicates the state of the functional unit (FU):
Busy – Indicates whether the unit is busy or not
Op - The operation to perform in the unit (+,-, etc.)
Fi - Destination register
Fj, Fk – Source register numbers
Qj, Qk – Functional units producing source registers
Rj, Rk – Flags indicating when Fj, Fk are ready
3. Register result status.
Indicates which functional unit will write each register.
Blank if no pending instructions will write that register.
19
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE
Instruction status Read Execution Write
Instruction j k Issue operandscompleteResult
LD F6 34+ R2
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
FU
20
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 1
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Integer
21
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 2
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Integer
22
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 3
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Integer
23
• CSE-4821
Issue stalls
Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 4
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Integer
24
• Issue stalls
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 5
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Integer
25
InCSE-4821
this cycle the 2nd load is issued.
Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 6
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6
MULTDF0 F2 F4 6
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 Integer
26
• MULT is issued but has to wait for F2
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 7
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add Yes Sub F8 F6 F2 Integer Yes No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 Integer Add
27
Now, SUBD can be issued, but has to wait for operand F2 to read.
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 8A
28
• CSE-4821
DIVD is issued but there is another RAW hazard
Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 8B
29
• Load completes, and operands for MULT and SUBD are ready
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 9
30
MULT and SUB are sent in execution in parallel
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 11
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
8 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
0 Add Yes Sub F8 F6 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
11 FU Mult1 Add Divide
31
• CSE-4821
TheDr.SUBD finishes
M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 12
32
• Read operands for DIVD?
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 13
33
• CSE-4821
SUBD writes results and ADDD can be issued
Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 14
Instruction status Read Execution
Write
Instruction j k Issueoperands
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVDF10 F0 F6 8
ADDD F6 F8 F2 13 14
Functional unit status dest S1 S2 FU for FU
j for F
k j? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
5 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
2 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
14 FU Mult1 Add Divide
34
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 15
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
4 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
1 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
15 FU Mult1 Add Divide
35
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 16
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
3 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
0 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
16 FU Mult1 Add Divide
36
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 17
37
• CSE-4821
Write result of ADDD? NO, there is a WAR hazard
Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 18
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
1 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
18 FU Mult1 Add Divide
38
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, • Stall, nothing to do
Gazipur
SCOREBOARD
EXAMPLE CYCLE 19
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
0 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
19 FU Mult1 Add Divide
39
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, • MULT, execution completed
Gazipur
SCOREBOARD
EXAMPLE CYCLE 20
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
20 FU Add Divide
40
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, • MULTD write back result
Gazipur
SCOREBOARD
EXAMPLE CYCLE 21
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
21 FU Add Divide
41
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, • DIVD, read operand of F0
Gazipur
SCOREBOARD
EXAMPLE CYCLE 22
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
40 Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
22 FU Divide
42
Now DIVD
CSE-4821 Dr. M. A.can read
Rouf, Dept. itsDUET,
of CSE, operands, and ADDD can write the result
Gazipur
SCOREBOARD
EXAMPLE CYCLE 61
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
0 Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
61 FU Divide
43
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, • DIVD finishes execution
Gazipur
SCOREBOARD
EXAMPLE CYCLE 62
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61 62
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
0 Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
62 FU
44
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
CDC 6600 SCOREBOARD
Achieves a speedup of 2.5 w.r.t. no dynamic
scheduling
By reorganizing instructions the compiler
achieves only 1.7
But
• No cache
• No forwarding hardware
• Limited to instructions in a basic block
• Small number of functional units (structural hazards)
• Wait for WAR hazards
• Prevent WAW hazards
45
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
BRANCH PREDICTION
Current DLX wastes one cycle but other
architectures compute branches several
cycles after the IF stage.
We need to predict ASAP branch result (ID
stage).
Performance of Branch Prediction depends
on:
• Accuracy measured in terms of percentage of
misprediction
• Cost of Misprediction measured in terms of the time
wasted to execute unuseful instructions.
46
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
BRANCH HISTORY TABLE
Table of 1 bit values
Indexed by the lower bits of the PC address
Says whether or not branch taken last time
47
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
BRANCH HISTORY TABLE
Problem: in a loop, 1 bit BHT will cause two mispredictions:
1. When we arrive to the end of the loop and we must exit.
Here the BHT predicts to stay in the loop.
2. When we re-enter the loop, we reach the end and we
must stay in the loop. Here the BHT predicts to exit
48
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
DYNAMIC BRANCH PREDICTION
It is a 2 bit scheme in which we change prediction only if we
get misprediction twice.
For each index of the table, the 2 bits report the state of a
state machine (next slide).
When we arrive at the end of the loop, we dont change
prediction.
49
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
WE CAN DESCRIBE THE
ALGORITHM WITH A FSM
50
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
BRANCH HISTORY
TABLE ACCURACY
We have a misprediction when
• We make a wrong guess for that branch
• Because the same index can be referenced by two different
branches, sometimes we get the history of the wrong branch
51
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
BRANCH HISTORY
TABLE ACCURACY
It has been measured that a 4096 entry
table, programs have a misprediction
percentage from 1% to 18%:
• Nasa7, tomcatv 1%
• Eqntott 18%
• Spice 9%
• Gcc 12%
4096 about as good as infinite table (for the
Alpha 21164)
52
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
CORRELATING BRANCHES
Basic hypothesis: recent branches are
correlated, i.e., behavior of recently
executed branches affects the prediction of
current branch:
53
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
CORRELATING
BRANCHES EXAMPLE
subi R3,R1,2
bnez R3,L1
add r1,r0,r0; bb1
If(a==2) bb1;
L1: subi r3,r1,2
L1: If(b==2) bb2;
bnez r3,L2
L2: If(a!=b) bb3;
add r2,r0,r0; bb2
L2: sub r3,r1,r2
beqz r3,L3
...; bb3
L3:
54
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
IDEA:
record m most recently executed branches as taken or not
taken. Use that pattern to select the proper branch history
table.
55
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
EXAMPLE OF A SIMPLE
CORRELATING
PREDICTOR
The branch is predicted on the basis of the
previously executed one by selecting the
appropriate 1 bit BHT.
Branch Prediction Table 1 1 Branch Prediction Table
if last branch taken 0 1 if last branch not taken
.... ....
Branch to be predicted
56
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
effective branch result
Gazipur
(M,N) PREDICTORS
In general, (m,n) predictor means
record last m branches to select
between 2^m, n-bit history tables.
57
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
EXAMPLE OF A (2,2)
CORRELATING BRANCH
PREDICTOR
58
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
ACCURACY OF DIFFERENT
SCHEMES
18%
16%
14%
4096 Entries 2-bit BHT
Unlimited Entries 2-bit BHT
Frequency of Mispredictions
12% 11%
1024 Entries (2,2) BHT
10%
8%
6% 6% 6%
6% 5% 5%
4%
4%
2% 1% 1%
0%
0%
doducd
gcc
nasa7
eqntott
espresso
spice
fpppp
tomcatv
li
matrix300
4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)
59
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
ADDRESS MUST ALSO
BE PREDICTED
Access in the IF stage the Branch Target Buffer:
Tipical Entry:
60
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
BRANCH TARGET
BUFFER STRUCTURE
Pc of fetched instruction
Associative lookup Predicted PC
61
PC should be used as next PC
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
BRANCH TARGET BUFFER
62
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD SCHEME
IMPLEMENTS THE ILP
LECTURE 3 CONTD..
2
EXAMPLE:
1. DIVD F0,F2,F4
2. ADDD F10,F0,F8
3. SUBD F12,F8,F14
3
DEPENDENCE
1. Data Dependence: True dependence
2. Name dependence: WAR or WAW
3. Control Dependence
• Data Dependency:
• Instruction i produces a result that may be used by
instruction j
• Instruction j is data dependent on instruction k and k is
data dependent on instruction i (transitive dependency)
4
DEPENDENCE
Data Dependence
Name Dependence
5
NAME DEPENDENCE
• Name dependence:
• Name dependence occurs when two or more
instructions use the same register or memory
location, called a name but there is no flow of
data between instructions associated with the
name.
• Two types of name dependence
• Anti dependence: Write After Read (WAR)
• Output dependence: Write After Write (WAW)
6
SCOREBOARD SCHEME 4 STAGES
7
SCOREBOARD IMPLICATIONS
• Out-of-order completion -> WAR and
WAW hazards are resolved by:
1. Queue both the operations and copies of its
operands.
2. Read registers only during Read Operands
stage.
8
SCOREBOARD IMPLICATIONS
1. For WAW, the machine stalls write
result until the previous instruction
completes
2. Multiple execution units
3. Scoreboard keeps track of
dependencies and state of
operations.
9
FOUR STAGES OF
SCOREBOARD CONTROL
1. Issue
• Decode instructions and check for structural hazards.
• If a functional unit for the instruction is free and no
other active instruction has the same destination
register (WAW), the scoreboard issues the instruction
to the functional unit and updates its internal data
structure.
• If a structural or a WAW hazard exists, then the
instruction issue stalls, and no further instructions will
issue until these hazards are cleared.
10
FOUR STAGES OF
SCOREBOARD CONTROL
2. Read Operands
a) Wait until no data hazards, then read operands
b) A source operand is available if:
- no earlier issued active instruction will write it or
- A functional unit is writing its value in a register
c) When the source operands are available, the
scoreboard tells the functional unit to proceed to
read the operands from the registers and begin
execution.
d) RAW hazards are resolved dynamically in this step,
and instructions may be sent into execution out of
order.
11
FOUR STAGES OF
SCOREBOARD CONTROL
3. Execution Stage
• Operate on operands
• The functional unit begins execution upon receiving
operands. When the result is ready, it notifies the
scoreboard that it has completed execution.
• FUs are characterized by:
- Latency (the effective time used to complete one
operation).
- Initiation interval (the number of cycles that must
elapse between issuing two operations to the same
functional unit).
12
FOUR STAGES OF
PIPELINE CONTROL
4. Write result
• Finish execution
• Once the scoreboard is aware that the
unit has completed execution, the
scoreboard checks for WAR hazards.
• If none, it writes results.
• If WAR, then write result is stalled.
13
WAR EXAMPLE
Register Renaming
DIVD F0,F2,F4 DIVD F0,F2,F4
ADDD F10,F0,F9
ADDD F10,F0,F8
SUBD F8,F9,F14
SUBD F8,F8,F14
In this case, the scoreboard would stall the SUBD
in the WB stage,waiting that ADDD reads F0 and F8.
14
SCOREBOARD STRUCTURE
1. Instruction status
2. Functional Unit status
Indicates the state of the functional unit (FU):
a) Busy – Indicates whether the unit is busy or not
b) Op - The operation to perform in the unit (+,-, etc.)
c) Fi - Destination register
d) Fj, Fk – Two Source Register Numbers
e) Qj, Qk – Functional units producing source registers
f) Rj, Rk – Flags indicating when Fj, Fk are ready
3. Register result status.
• Indicates which functional unit will write each register.
Blank if no pending instructions will write that register.
15
ASSUMPTIONS FOR
EXAMPLE
1. Load/store unit :
a) Address calculation unit :1
b) Latency :1 cycle
2. ALU/Integer unit : Execution latency 2 cycles
3. Floating point MultD: Execution latency 10
cycles
4. Floating point DivD: Execution latency 40
cycles
16
SCOREBOARD EXAMPLE
Instruction status Read Execution
Write
Instruction j k Issue operands
complete
Result
LD F6 34+ R2
LD F2 45+ R3
MULTD F0 F2 F4
SUBDF8 F6 F2
DIVDF10 F0 F6
ADDDF6 F8 F2
Functional unit status dest S1 S2 FU for FU
j for Fk j? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
Divide No
17
Register result status
SCOREBOARD
EXAMPLE CYCLE 1
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Integer
18
SCOREBOARD
EXAMPLE CYCLE 2
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Integer
19
SCOREBOARD
EXAMPLE CYCLE 3
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Integer
20
• Issue stalls
SCOREBOARD
EXAMPLE CYCLE 4
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Integer
21
• Issue stalls due to single LD/ST unit and single address
calculation unit
SCOREBOARD
EXAMPLE CYCLE 5
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Integer
22
In this cycle the 2nd load is issued.
SCOREBOARD
EXAMPLE CYCLE 6
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6
MULTDF0 F2 F4 6
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 Integer
• MULT is issued but has to wait for F2
23
SCOREBOARD
EXAMPLE CYCLE 7
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add Yes Sub F8 F6 F2 Integer Yes No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 Integer Add
24
• Now, SUBD can be issued, but has to wait for operand F2 to read.
SCOREBOARD
EXAMPLE CYCLE 8A
25
• DIVD is issued but there is another RAW hazard for F0
SCOREBOARD
EXAMPLE CYCLE 8B
26
• Load completes, and operands for MULTD and SUBD are ready
SCOREBOARD
EXAMPLE CYCLE 9
27
execution in parallel
SCOREBOARD
EXAMPLE CYCLE 11
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
8 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
0 Add Yes Sub F8 F6 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
11 FU Mult1 Add Divide
28
• The SUBD finishes execution
SCOREBOARD
EXAMPLE CYCLE 12
29
• Read operands for DIVD: Can not read F0 before MULTD
writes F0
SCOREBOARD
EXAMPLE CYCLE 13
30
• SUBD writes results and ADDD can be issued
SCOREBOARD
EXAMPLE CYCLE 14
Instruction status Read Execution
Write
Instruction j k Issueoperands
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVDF10 F0 F6 8
ADDD F6 F8 F2 13 14
Functional unit status dest S1 S2 FU for FU
j for F
k j? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
5 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
2 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
14 FU Mult1 Add Divide
31
• ADDD can read operands
SCOREBOARD
EXAMPLE CYCLE 15
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
4 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
1 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
15 FU Mult1 Add Divide
32
• ADDD can executes on operands
SCOREBOARD
EXAMPLE CYCLE 16
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
3 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
0 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
16 FU Mult1 Add Divide
33
• ADDD finishes execution
SCOREBOARD
EXAMPLE CYCLE 17
34
• Write result of ADDD? NO, there is a WAR hazard for F6
SCOREBOARD
EXAMPLE CYCLE 18
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
1 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
18 FU Mult1 Add Divide
35
• Stall continued, nothing to do
SCOREBOARD
EXAMPLE CYCLE 19
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
0 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
19 FU Mult1 Add Divide
36
• MULTD, execution completed
SCOREBOARD
EXAMPLE CYCLE 20
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
20 FU Add Divide
37
• MULTD write back result
SCOREBOARD
EXAMPLE CYCLE 21
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
21 FU Add Divide
38
• DIVD, read operand of F0
SCOREBOARD
EXAMPLE CYCLE 22
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
40 Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
22 FU Divide
• Now DIVD can execute on its operands, and ADDD can write
39
the result
SCOREBOARD
EXAMPLE CYCLE 61
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
0 Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
61 FU Divide
40
• DIVD finishes execution after 40 cycles
SCOREBOARD
EXAMPLE CYCLE 62
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61 62
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
0 Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
62 FU
41
• DIVD finishes after writing back results
Lecture 3.3
Tomasulo’s Algorithm
Dynamic Scheduling Using Tomasulo’s
Algorithm
• It is used in IBM 360/91 floating point unit
• Invented by Robert Tomasulo of IBM
• It keeps tracks whenever operands are available
• It minimizes RAW hazards
• It introduces register renaming to minimize WAW
and RAW hazards
ISSUE.
Get an instruction I from the queue. If it is an FP op.
Check if an RS is empty (i.e., check for structural
hazards).
Rename registers;
WAR resolution: For instruction I and J. I is an
instruction which reads Rx, then if J writes Rx. J
already knows the value of Rx or knows what
instruction will write it. So the RF can be linked to I.
WAW resolution: Since we use in-order issue, the
RF can be linked to I.
Finish-up
B[I] = X ....
SW 0(R2), . . .
J join
Else: ....
X
C[I] = Join: ....
CSE-4821 Compiler Techniques 17
for ILP, by Dr. M. A. Rouf SW 0(R3), . . .
Summary of Compiler
Techniques
• Try to avoid dependence stalls
• Loop unrolling
– Reduce loop overhead
• Software pipelining
– Reduce single body dependence stalls
• Trace scheduling
– Reduce impact of other branches
• Compilers use a mix of three
• All techniques depend on prediction
accuracy
CSE-4821 Compiler Techniques 18
for ILP, by Dr. M. A. Rouf
CSE 4821 Advanced Computer Architecture
Dr. M. A. Rouf
Dept. of CSE
DUET, Gazipur
Performance beyond single thread ILP
• There can be much higher natural parallelism in
some applications
(e.g., Database or Scientific codes)
• Explicit Thread Level Parallelism or Data Level
Parallelism
• Thread: process with own instructions and data
• thread may be a process part of a parallel program of multiple
processes, or it may be an independent program
• Each thread has all the state (instructions, data, PC, register state, and
so on) necessary to allow it to execute
• Data Level Parallelism: Perform identical operations
on data, and lots of data
From: Tullsen,
Eggers, and Levy,
“Simultaneous
Multithreading:
Maximizing On-chip
Parallelism, ISCA
1995.
CSE-4821 Dr. M. A. Rouf, Dept. of
7
CSE, DUET, Gazipur
Do both ILP and TLP?
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
16/48 = 33.3% 27/48 = 56.3% 27/48 = 56.3% 29/48 = 60.4% 42/48 = 87.5%
Thread 1 Thread 3 Thread 5
Thread 2 Thread 4 Idle slot
CSE-4821 Dr. M. A. Rouf, Dept. of
11
CSE, DUET, Gazipur
Design Challenges in SMT
• Since SMT makes sense only with fine-grained implementation,
impact of fine-grained scheduling on single thread performance?
• A preferred thread approach sacrifices neither throughput nor single-thread
performance?
• Unfortunately, with a preferred thread, the processor is likely to sacrifice some
throughput, when preferred thread stalls
• Larger register file needed to hold multiple contexts
• Not affecting clock cycle time, especially in
• Instruction issue - more candidate instructions need to be considered
• Instruction completion - choosing which instructions to commit may be
challenging
• Ensuring that cache and TLB conflicts generated by SMT do not
degrade performance