Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

1 Rouf

Download as pdf or txt
Download as pdf or txt
You are on page 1of 286

Case Study: Intel Processors

Courtesy: Intel Corp.

By: Dr. M. A. Rouf


Professor, Dept. of CSE, DUET, Gazipur
Course Introduction
• Course Teacher:
– Professor Dr. Mohammad Abdur Rouf
– Dr. Mohammad Jakirul Islam
• Course Website:
– https://sites.google.com/a/duet.ac.bd/marouf-
cse/courses-2018/cse-4821-advanced-computer-
archtecture

CMOS VLSI Design


Case Study: Intel Processors Slide 2
Course Introduction
• Email:
– Dr. M. A. Rouf
• rouf7606@gmail.com, marouf.cse@duet.ac.bd
– Dr. Jakirul Islam
• jakirduet@gmail.com
• Zoom Class Room Meeting link will be notified
via CR or some suitable forum

CMOS VLSI Design


Case Study: Intel Processors Slide 3
Course Introduction
• Attendance
– Attendance will be taken during class time
• Class material and ppt slides will be uploaded earlier
• It is advisable to download and print the slide before
class time
• If the class meeting is disturbed due to power
disruption or network failure it will be solved after
discussion

CMOS VLSI Design


Case Study: Intel Processors Slide 4
Outline
• Evolution of Intel Microprocessors
– Scaling from 4004 to Pentium 4
– Courtesy of Intel Museum

CMOS VLSI Design


Case Study: Intel Processors Slide 5
4004
• First microprocessor (1971)
– For Busicom calculator of Nippon
Calculator
• Characteristics
– 10 mm process
– 2300 transistors
– 400 – 800 kHz
– 4-bit word size
– 16-pin DIP package
• Intel 4004 was a part of MCS-4 chipset,
which included the following chips:
– 4001 - 256-bit mask ROM and 4-bit I/O
device,
– 4002 - 320-bit RAM and 4-bit I/O device,
– 4003 - 10-bit shift register,
– 4008 and 4009 - standard memory and
I/O interface set.

CMOS VLSI Design


Slide 6
8008
• 8-bit follow-on (1972)
– Dumb terminals
• Characteristics
– 10 mm process
– 3500 transistors
– 500 – 800 kHz
– 8-bit word size
– 18-pin DIP package
– 16 KB Physical memory

CMOS VLSI Design


Case Study: Intel Processors Slide 7
8080
• 16-bit address bus (1974)
– Used in Altair computer
• (early hobbyist PC)
• Characteristics
– 6 mm process
– 4500 transistors
– 2 MHz
– 8-bit word size
– 40-pin DIP package

CMOS VLSI Design


Case Study: Intel Processors Slide 8
8086 / 8088
• 16-bit processor (1978-9)
– IBM PC and PC XT
– Revolutionary products
– Introduced x86 ISA
• Characteristics
– 3 mm process
– 29k transistors
– 5-10 MHz
– 16-bit word size
– 40-pin DIP package
• Microcode ROM

CMOS VLSI Design


Case Study: Intel Processors Slide 9
80286
• Virtual memory (1982)
– IBM PC AT
• Characteristics
– 1.5 mm process
– 134k transistors
– 6-12 MHz
– 16-bit word size
– 68-pin PGA
• Regular datapaths and
ROMs
Bitslices clearly visible

CMOS VLSI Design


Case Study: Intel Processors Slide 10
80386
• 32-bit processor (1985)
– Modern x86 ISA
• Characteristics
– 1.5-1 mm process
– 275k transistors
– 16-33 MHz
– 32-bit word size
– 100-pin PGA
• 32-bit datapath,
microcode ROM,
synthesized control

CMOS VLSI Design


Case Study: Intel Processors Slide 11
80486
• Pipelining (1989)
– Floating point unit
– 8 KB cache
• Characteristics
– 1-0.6 mm process
– 1.2M transistors
– 25-100 MHz
– 32-bit word size
– 168-pin PGA (Pin Grid Array)
• Cache, Integer datapath,
FPU, microcode,
synthesized control

CMOS VLSI Design


Case Study: Intel Processors Slide 12
Pentium
• Superscalar (1993)
– 2 instructions per cycle
– Separate 8KB I$ & D$
• Characteristics
– 0.8-0.35 mm process
– 3.2M transistors
– 60-300 MHz
– 32-bit word size
– 296-pin PGA
• Caches, datapath,
FPU, control

CMOS VLSI Design


Case Study: Intel Processors Slide 13
Pentium Pro / II / III
• Dynamic execution (1995-9)
– 3 micro-ops / cycle
– Out of order execution
– 16-32 KB I$ & D$
– Multimedia instructions
– PIII adds 256+ KB L2$
• Characteristics
– 0.6-0.18 mm process
– 5.5M-28M transistors
– 166-1000 MHz
– 32-bit word size
– Multi-chip Module (MCM)
– Single Edge Contact Cartridge
(SECC)

CMOS VLSI Design


Case Study: Intel Processors Slide 14
Pentium 4
• Deep pipeline (2001)
– 20 stage pipeline
– Very fast clock
– 256-1024 KB L2$
• Characteristics
– 180 – 90 nm process
– 42-125M transistors
– 1.4-3.4 GHz
– 32-bit word size
– 478-pin PGA
• Units start to become
invisible on this scale

CMOS VLSI Design


Case Study: Intel Processors Slide 15
Core i3
• Processor cores: 2
– 45nm process
– 64-bit word size
– Power optimized front slide bus
– Radix-16 technology divider adds:
• Divider and square root in same chip.
– Deeper buffers
– 14 stage efficient pipeline
– Micro and Macro Ops Fusion
– Additional ALU
– Advanced Branch Prediction
CMOS VLSI Design
Case Study: Intel Processors Slide 16
Comparison of Different cores
Features Core i3 Core i5 Core i7
Cores 2 4 4
Hyper- Yes No yes
threading
Turbo Boost No Yes Yes
K-Model No Yes Yes
Cache 2-4 MB 4-6MB 8MB
Clock 3.4-4.2 GHz 2.4 – 3.8 GHz 2.9-4.2 GHz

CMOS VLSI Design


Case Study: Intel Processors Slide 17
Summary
• 104 increase in transistor count, clock
frequency over 30 years!

CMOS VLSI Design


Case Study: Intel Processors Slide 18
LECTURE1:
FUNDAMENTAL OF
COMPUTER DESIGN
DR. M. A. ROUF PH.D.
(KAIST)
DHAKA UNIVERSITY OF ENGINEERING AND
TECHNOLOGY (DUET)
CSE-4821: ADVANCED COMPUTER
ARCHITECTURE
SINGLE PROCESSOR PERFORMANCE

2
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
3
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
CLASSES OF
COMPUTERS
Personal Mobile Device (PMD)
• e.g. smart phones, tablet computers
• Emphasis on energy efficiency and real-time for media apps
Desktop Computing
• Emphasis on price-performance

4
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
CLASSES OF
COMPUTERS (CONTD..)
Servers

Emphasis on availability, scalability, throughput

Clusters / Warehouse Scale Computers

Used for “Software as a Service (SaaS)”

Emphasis on availability and price-performance

Sub-class: Supercomputers, emphasis: floating-point
performance and fast internal networks
Embedded Computers
• Microwaves, washing machines, printers, networking switches
• Emphasis: price

5
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
CURRENT TRENDS
Cannot continue to exploit Instruction-Level parallelism (ILP)
• Single processor performance improvement ended in 2003

New models for performance:


• Data-level parallelism (DLP)
• Thread-level parallelism (TLP)
• Request-level parallelism (RLP)
• These require explicit restructuring of applications

6
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
PARALLELISM
Classes of parallelism in applications:
• Data-Level Parallelism (DLP)
• Task-Level Parallelism (TLP)
Classes of architectural parallelism:
• Instruction-Level Parallelism (ILP)
• Exploit DLP
• Vector architectures/Graphic Processor Units (GPUs)
• Exploit DLP
• Thread-Level Parallelism
• Exploit DLP or TLP
• Request-Level Parallelism
• Exploit TLP

7
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
LAYER OF SYSTEM ARCHITECTURE

8
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
DEFINING COMPUTER
ARCHITECTURE
The task of computer designer:
Determine what attributes are important for a new
computer, then design a computer to maximize performance
while staying within cost, power, and availability constrains

9
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
DEFINING COMPUTER
ARCHITECTURE
This task has many aspects:
• Instruction set design
• Functional organization
• Logic design
• And implementation
Also,
• Integrated circuit design
• Packaging
• Power
• Cooling
AND
• Optimization, including a lot of technologies (complier, OS…)

10
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
INSTRUCTION SET
ARCHITECTURE (ISA)
The instruction set architecture
serves as the boundary between
the software and hardware.

11
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
TRENDS IN
TECHNOLOGY
To evaluate a computer, designer must
be aware of rapid changes in
implementation technology
• Integrated circuit logic:
• transistor density increase by about 35% per year
• Increase in die size is ranging from 10% to 20%
per year
• The combined effect is a growth rate in transistor
count on a chip is about 40%~55% per year

12
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
TRENDS IN
TECHNOLOGY
• DRAM (dynamic random-access memory):
• Capacity increases by about 40% per year, doubling
roughly every two years

• Magnetic disk technology


• Before 1990: 30% per year, doubling in 3 years
• 1996~2004: from 60% to 100% increase per year
• After 2004: drop back to 30% per year
• Despite this roller coaster of rates of improvement, it is
still 50-100 times cheaper than DRAM
• Flash Memory
• LAN

13
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
CSE 4821
Advanced Computer Architecture

LECTURE 3

INSTRUCTION SET PRINCIPLES, PIPELINE HAZARDS

PROF. DR. M. A. ROUF


DEPT. OF CSE, DUET
INSTRUCTION SET DESIGN ISSUES

Instruction set design issues include:


• Where are operands stored?
• registers, memory, stack, accumulator
• How many explicit operands are there?
• 0, 1, 2, or 3
• How is the operand location specified?
• register, immediate, indirect, . . .
• What type & size of operands are supported?
• byte, int, float, double, string, vector. . .
• What operations are supported?
• add, sub, mul, move, compare . . .

2
EVOLUTION OF INSTRUCTION SETS
Single Accumulator (EDSAC 1950, Maurice Wilkes)
Accumulator + Index Registers
(Manchester Mark I, IBM 700 series 1953)

Separation of Programming Model


from Implementation

High-level Language Based Concept of a Family


(B5000 1963) (IBM 360 1964)
General Purpose Register Machines

Complex Instruction Sets Load/Store Architecture


(Vax, Intel 432 1977-80) (CDC 6600, Cray 1 1963-76)

CISC RISC
Intel x86, Pentium (MIPS,Sparc,HP-PA,IBM RS6000,PowerPC . . .1987)

3
CLASSIFYING ISAS
Accumulator (before 1960, e.g. 68HC11):
1-address add A acc acc + mem[A]

Stack (1960s to 1970s):


0-address add tos tos + next

Memory-Memory (1970s to 1980s):


2-address add A, B mem[A] mem[A] + mem[B]
3-address add A, B, C mem[A] mem[B] + mem[C]

Register-Memory (1970s to present, e.g. 80x86):


2-address add R1, A R1 R1 + mem[A]
load R1, A R1 mem[A]

Register-Register (Load/Store) (1960s to present, e.g. MIPS):


3-address add R1, R2, R3 R1 R2 + R3
load R1, R2 R1 mem[R2]
store R1, R2 mem[R1] R2

4
OPERAND LOCATIONS IN FOUR ISA CLASSES
GPR

5
WORD-ORIENTED MEMORY
ORGANIZATION
32-bit 64-bit Bytes Addr.
Words Words
0000
Memory is byte addressed and Addr
= 0001
provides access for bytes (8 0000
?? 0002
bits), half words (16 bits), words Addr
= 0003
(32 bits), and double words(64 0000
?? 0004
bits). Addr
= 0005
0004
?? 0006
0007
Addresses Specify Byte Locations 0008
Addr
• Address of first byte in word = 0009
• Addresses of successive words differ 0008
??
Addr
0010
by 4 (32-bit) or 8 (64-bit) = 0011
0008
??
0012
Addr
= 0013
0012
?? 0014
0015

6
BYTE ORDERING
How should bytes within multi-byte word be ordered in memory?
Conventions
• Sun’s, Mac’s are “Big Endian” machines
• Least significant byte has highest address
• Alphas, PC’s are “Little Endian” machines
• Least significant byte has lowest address

7
BYTE ORDERING EXAMPLE
Big Endian
• Least significant byte has highest address
Little Endian
• Least significant byte has lowest address
Example
• Variable x has 4-byte representation 0x01234567
• Address given by &x is 0x100

Big Endian 0x100 0x101 0x102 0x103


01
01 23
23 45
45 67
67
Little Endian 0x100 0x101 0x102 0x103
67
67 45
45 23
23 01
01

8
TYPES OF OPERATIONS
Arithmetic and Logic: AND, ADD
Data Transfer: MOVE, LOAD, STORE
Control: BRANCH, JUMP, CALL
System: OS CALL, VM
Floating Point: ADDF, MULF, DIVF
Decimal: ADDD, CONVERT
String: MOVE, COMPARE
Graphics: (DE)COMPRESS

9
TOP 10 80X86 INSTRUCTIONS
° Rank instruction Integer Average Percent total executed
1 load 22%
2 conditional branch 20%
3 compare 16%
4 store 12%
5 add 8%
6 and 6%
7 sub 5%
8 move register-register 4%
9 call 1%
10 return 1%
Total 96%
° Simple instructions dominate instruction frequency

10
RELATIVE FREQUENCY OF
CONTROL INSTRUCTIONS

Operation SPECint92 SPECfp92


Call/Return 13% 11%
Jumps 6% 4%
Branches 81% 87%

• Design hardware to handle branches quickly,


since these occur most frequently

11
THE MIPS INSTRUCTION FORMATS
All MIPS instructions are 32 bits long. The three instruction formats:
31 26 21 16 11 6 0
• R-type op rs rt rd shamt funct
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits
31 26 21 16 0
• I-type immediate
op rs rt
6 bits 5 bits 5 bits 16 bits
• J-type 31 26 0
op target address
6 bits 26 bits
The different fields are:
• op: operation of the instruction
• rs, rt, rd: the source and destination register specifiers
• shamt: shift amount
• funct: selects the variant of the operation in the “op” field
• address / immediate: address offset or immediate value
• target address: target address of the jump instruction

12
MIPS ADDRESSING MODES/INSTRUCTION FORMATS

• All instructions 32 bits wide

Register (direct) op rs rt rd

register

Immediate op rs rt immed

Displacement
op rs rt immed
Memory

register +
PC-relative
op rs rt immed
Memory

PC +

13
REVIEW: 5-STAGE
EXECUTION
5 canonical stage “RISC” load-store architecture
1. Instruction fetch (IF):
• get instruction from memory/cache
2. Instruction decode, Register read (ID):
• translate opcode into control signals and read regs
3. Execute (EX):
• perform ALU operation, load/store address, branch outcomes
4. Memory (MEM):
• access memory if load/store, everyone else idle
5. Writeback/retire (WB):
• write results to register file

14
SOLUTION
Overlap execution of instructions
• Start instruction on every cycle, e.g. the new instruction can be fetched while the
previous one is decoded – pipeline. Each cycle performing a specific task; number of
stages is called pipeline depth (5 here)

Non-pipelined

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB
Pipelined

15
Pipeline Progress – Instn moves with all control signals, addresses, data items =>
different register lengths at different stages

M
U
X

1 + target
+ PC+1 PC+1
R0 0
R1
eq?
regA ALU
M
instruction

regB R2 result
R3
valA U
Inst A X
PC ALU
Register file

R4 L mdata
mem result
R5 U
valB M Data
R6
U memory
R7 data
X
offset dest
valB
Bits 11-15
M
Bits 16-20 U dest dest dest
X

IF/ ID/ EX/ Mem/


ID EX Mem WB

16
DATA HAZARD - STALLING
0 2 4 6 8 10 12 16 18

add $s0,$t0,$t1 W
IF ID EX MEM s0 $s0
written
here

STALL
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE

STALL
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE

sub $t2,$s0,$t3 R
IF s0 EX MEM WB

$s0 read
here

17
DATA HAZARDS
Two different instructions use the same storage
location
• It must appear as if they executed in sequential order

add R1, R2, R3 add R1, R2, R3 add R1, R2, R3


sub R2, R4, R1 sub R2, R4, R1 sub R2, R4, R1
or R1, R6, R3 or R1, R6, R3 or R1, R6, R3
read-after-write write-after-read write-after-write
(RAW) (WAR) (WAW)

True dependence anti dependence output dependence


(real) (artificial) (artificial)

Where (How) do WAR and WAW hazards occur ?

18
Control Hazards CONTROL HAZARD ON BRANCHES
THREE STAGE STALL

ALU
10: beq r1,r3,36 Ifetch Reg DMem Reg

ALU
Ifetch Reg DMem Reg
14: and r2,r3,r5

ALU
Reg Reg
18: or r6,r1,r7 Ifetch DMem

ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9

ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg

The penalty when branch take is 3 cycles!

19
CONTROL HAZARDS
Branch problem:
• branches are resolved in EX stage
 3 cycles penalty on taken branches
Ideal CPI =1. Assuming 3cycles for all branches and 32% branch
instructions  new CPI = 1 + 0.32*3 = 1.96

Solutions:
• Reduce branch penalty: change the datapath – new adder needed
in ID stage.
• Fill branch delay slot(s) with a useful instruction.
• Fixed branch prediction.
• Static branch prediction.
• Dynamic branch prediction.

20
Pipeline: Hazards

Dr. M. A. Rouf
Professor
Dept. of CSE
Pipelining Outline
• Introduction
– Defining Pipelining
– Pipelining Instructions
• Hazards
– Structural hazards \
– Data Hazards
– Control Hazards
• Performance
• Controller implementation
CSE-4821 by Professor Dr. M. A. Rouf 2
Pipeline Hazards
• Where one instruction cannot immediately follow
another
• Types of hazards
– Structural hazards - attempt to use the same resource
by two or more instructions
– Control hazards - attempt to make branching decisions
before branch condition is evaluated
– Data hazards - attempt to use data before it is ready
• Can always resolve hazards by waiting

CSE-4821 by Professor Dr. M. A. Rouf 3


Structural Hazards
• Attempt to use the same resource by two or more
instructions at the same time
• Example: Single Memory for instructions and data
–Accessed by IF stage
–Accessed at same time by MEM stage
• Solutions
–Delay the second access by one clock cycle, OR
–Provide separate memories for instructions & data
• This is what the book does
• This is called a “Harvard Architecture”
• Real pipelined processors have separate caches

CSE-4821 by Professor Dr. M. A. Rouf 4


Pipelined Example -
Executing Multiple Instructions
• Consider the following instruction
sequence:
lw $r0, 10($r1)
sw $sr3, 20($r4)
add $r5, $r6, $r7
sub $r8, $r9, $r10

CSE-4821 by Professor Dr. M. A. Rouf 5


Executing Multiple Instructions
Clock Cycle 1
LW

CSE-4821 by Professor Dr. M. A. Rouf 6


Executing Multiple Instructions
Clock Cycle 2
SW LW

7
CSE-4821 by Professor Dr. M. A. Rouf
Executing Multiple Instructions
Clock Cycle 3
ADD SW LW

CSE-4821 by Professor Dr. M. A. Rouf 8


Executing Multiple Instructions
Clock Cycle 4
SUB ADD SW LW

CSE-4821 by Professor Dr. M. A. Rouf 9


Executing Multiple Instructions
Clock Cycle 5
SUB ADD SW LW

CSE-4821 by Professor Dr. M. A. Rouf 10


Executing Multiple Instructions
Clock Cycle 6
SUB ADD SW

CSE-4821 by Professor Dr. M. A. Rouf 11


Executing Multiple Instructions
Clock Cycle 7
SUB ADD

CSE-4821 by Professor Dr. M. A. Rouf 12


Executing Multiple Instructions
Clock Cycle 8
SUB

CSE-4821 by Professor Dr. M. A. Rouf 13


Alternative View - Multicycle Diagram

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8

lw $r0, 10($r1) IM REG ALU DM REG

sw $r3, 20($r4) IM REG ALU DM REG

add $r5, $r6, $r7 IM REG ALU DM REG

sub $r8, $r9, $r10 IM REG ALU DM REG

CSE-4821 by Professor Dr. M. A. Rouf 14


Alternative View - Multicycle Diagram

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8

lw $r0, 10($r1) IM REG ALU DM REG

Memory Conflict

sw $r3, 20($r4) IM REG ALU DM REG

add $r5, $r6, $r7 IM REG ALU DM REG

sub $r8, $r9, $r10 IM REG ALU DM REG

CSE-4821 by Professor Dr. M. A. Rouf 15


One Memory Port Structural Hazards
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Load

ALU
I Ifetch Reg DMem Reg

n
s
Instr 1

ALU
Ifetch Reg DMem Reg
t
r.
Instr 2

ALU
Ifetch Reg DMem Reg

O
r
d Stall Bubble Bubble Bubble Bubble Bubble

e
r Instr 3

ALU
Ifetch Reg DMem Reg

CSE-4821 by Professor Dr. M. A. Rouf 16


Structural Hazards
Some common Structural Hazards:
• Memory:
– we’ve already mentioned this one.
• Floating point:
– Since many floating point instructions require many cycles, it’s easy for them
to interfere with each other.
• Starting up more of one type of instruction than there are
resources.
– For instance, the PA-8600 can support two ALU + two load/store instructions
per cycle - that’s how much hardware it has available.

CSE-4821 by Professor Dr. M. A. Rouf 17


Structural Hazards
Dealing with Structural Hazards
Stall
• low cost, simple
• Increases CPI
• use for rare case since stalling has performance effect
Pipeline hardware resource
• useful for multi-cycle resources
• good performance
• sometimes complex e.g., RAM
Replicate resource
• good performance
• increases cost (+ maybe interconnect delay)
• useful for cheap or divisible resources

CSE-4821 by Professor Dr. M. A. Rouf 18


Structural Hazards

• Structural hazards are reduced with these rules:


– Each instruction uses a resource at most once
– Always use the resource in the same pipeline stage
– Use the resource for one cycle only
• Many RISC ISAs are designed with this in mind
• Sometimes very difficult to do this.
– For example, memory of necessity is used in the IF and
MEM stages.

CSE-4821 by Professor Dr. M. A. Rouf 19


Pipelining Outline
• Introduction
– Defining Pipelining
– Pipelining Instructions
• Hazards
– Structural hazards
– Data Hazards \
– Control Hazards
• Performance
• Controller implementation
CSE-4821 by Professor Dr. M. A. Rouf 20
Data Hazards
• Data hazards occur when data is used
before it is ready Time (in clock cycles)

Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
register $2: 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Program
execution
order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg

and $12, $2, $5 IM Reg DM Reg

or $13, $6, $2 IM Reg DM Reg

add $14, $2, $2 IM Reg DM Reg

sw $15, 100($2) IM Reg DM Reg

The use of the result of the SUB instruction in the next three instructions causes a data
hazard, since the register $2 is not written until after those instructions read it.

CSE-4821 by Professor Dr. M. A. Rouf 21


Data Hazards
Execution Order is:
Read After Write (RAW)
InstrI
InstrJ tries to read operand before InstrI writes it
InstrJ

I: add r1,r2,r3
J: sub r4,r1,r3

• Caused by a “Dependence” (in compiler nomenclature). This hazard results


from an actual need for communication.

CSE-4821 by Professor Dr. M. A. Rouf 22


Data Hazards
Execution Order is:
Write After Read (WAR)
InstrI
InstrJ tries to write operand before InstrI reads i
InstrJ – Gets wrong operand

I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
– Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”.

• Can’t happen in MIPS 5 stage pipeline because:


– All instructions take 5 stages, and
– Reads are always in stage 2, and
– Writes are always in stage 5

CSE-4821 by Professor Dr. M. A. Rouf 23


Data Hazards
Execution Order is:
Write After Write (WAW)
InstrI
InstrJ tries to write operand before InstrI writes it
InstrJ – Leaves wrong result ( InstrI not InstrJ )

I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.

• Can’t happen in MIPS 5 stage pipeline because:


– All instructions take 5 stages, and
– Writes are always in stage 5

• Will see WAR and WAW later in more complicated pipes

CSE-4821 by Professor Dr. M. A. Rouf 24


Data Hazard Detection in MIPS (1)
Read after Write
Time (in clock cycles)

Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
register $2: 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Program
execution IF/ID ID/EX EX/MEM MEM/WB
order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg

and $12, $2, $5 IM Reg DM Reg

or $13, $6, $2 IM Reg DM Reg

add $14, $2, $2 IM Reg DM Reg

sw $15, 100($2) IM Reg DM Reg

1a: EX/MEM.RegisterRd = ID/EX.RegisterRs


EX hazard
1b: EX/MEM.RegisterRd = ID/EX.RegisterRt
2a: MEM/WB.RegisterRd = ID/EX.RegisterRs MEM hazard
2b: MEM/WB.RegisterRd = ID/EX.RegisterRt

CSE-4821 by Professor Dr. M. A. Rouf 25


Data Hazards
• Solutions for Data Hazards
– Stalling
– Forwarding:
• connect new value directly to next stage
– Reordering

CSE-4821 by Professor Dr. M. A. Rouf 26


Data Hazard - Stalling
0 2 4 6 8 10 12 16 18

add $s0,$t0,$t1 W
IF ID EX MEM s0 $s0
written
here

STALL
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE

STALL
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE

sub $t2,$s0,$t3 R
IF s0 EX MEM WB

$s0 read
here

CSE-4821 by Professor Dr. M. A. Rouf 27


Data Hazards - Stalling

Simple Solution to RAW

• Hardware detects RAW and stalls


• Assumes register written then read each cycle
+ low cost to implement, simple
-- reduces IPC
• Try to minimize stalls

Minimizing RAW stalls

• Bypass/forward/shortcircuit (We will use the word “forward”)


• Use data before it is in the register
+ reduces/avoids stalls
-- complex
• Crucial for common RAW hazards

CSE-4821 by Professor Dr. M. A. Rouf 28



Data Hazards - Forwarding
Key idea: connect new value directly to next stage
• Still read s0, but ignore in favor of new result

• Problem: what about load instructions?

CSE-4821 by Professor Dr. M. A. Rouf 29


Data Hazards - Forwarding
• STALL still required for load - data avail. after MEM
• MIPS architecture calls this delayed load, initial
implementations required compiler to deal with this
0 2 4 6 8 10 12 16 18

ID W
lw $s0,20($t1) IF ID EX MEM s0

new value
of s0

STALL
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE

R
sub $t2,$s0,$t3 IF s0 EX MEM WB

CSE-4821 by Professor Dr. M. A. Rouf 30


Forwarding
Key idea: connect data internally before it's stored
Time (in clock cycles)

Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
register $2: 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Program
execution IF/ID ID/EX EX/MEM MEM/WB
order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg

and $12, $2, $5 IM Reg DM Reg

or $13, $6, $2 IM Reg DM Reg

add $14, $2, $2 IM Reg DM Reg

sw $15, 100($2) IM Reg DM Reg

How would you design the forwarding?

CSE-4821 by Professor Dr. M. A. Rouf 31


Data Hazard Solution: Forwarding
• Key idea: connect data internally before
it's stored Time (in clock cycles)
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Value of register $2 : 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Value of EX/MEM : X X X – 20 X X X X X
Value of MEM/WB : X X X X – 20 X X X X

Program
execution order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg

and $12, $2, $5 IM Reg DM Reg

or $13, $6, $2 IM Reg DM Reg

add $14, $2, $2 IM Reg DM Reg

sw $15, 100($2) IM Reg DM Reg

Assumption:
• The register file forwards values that are read and
written during the same
CSE-4821 cycle.
by Professor Dr. M. A. Rouf 32
Data Hazard Summary
• Three types of data hazards
– RAW (MIPS)
– WAW (not in MIPS)
– WAR (not in MIPS)
• Solution to RAW in MIPS
– Stall
– Forwarding
• Detection & Control
– EX hazard
– MEM hazard
• A stall is needed if read a register after a load
instruction that writes the same register.
– Reordering
CSE-4821 by Professor Dr. M. A. Rouf 33
Pipelining Outline
• Introduction
– Defining Pipelining
– Pipelining Instructions
• Hazards
– Structural hazards
– Data Hazards \
– Control Hazards
• Performance
• Controller implementation
CSE-4821 by Professor Dr. M. A. Rouf 34
Data Hazard Review
• Three types of data hazards
– RAW (in MIPS and all others)
– WAW (not in MIPS but many others)
– WAR (not in MIPS but many others)
• Forwarding

CSE-4821 by Professor Dr. M. A. Rouf 35


Data Hazard Detection in MIPS
Time (in clock cycles)
Read after Write Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
register $2: 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Program
execution IF/ID ID/EX EX/MEM MEM/WB
order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg

and $12, $2, $5 IM Reg DM Reg

or $13, $6, $2 IM Reg DM Reg

add $14, $2, $2 IM Reg DM Reg

sw $15, 100($2) IM Reg DM Reg

1a: EX/MEM.RegisterRd = ID/EX.RegisterRs


1b: EX/MEM.RegisterRd = ID/EX.RegisterRt EX hazard
2a: MEM/WB.RegisterRd = ID/EX.RegisterRs
MEM hazard
2b: MEM/WB.RegisterRd = ID/EX.RegisterRt
Problem?
Some instructions do not write register.
EX/MEM.RegWrite must
CSE-4821 be asserted!
by Professor Dr. M. A. Rouf 36
Data Hazards
• Solutions for Data Hazards
– Stalling
– Forwarding:
• connect new value directly to next stage
– Reordering

CSE-4821 by Professor Dr. M. A. Rouf 37


Data Hazard - Stalling
0 2 4 6 8 10 12 16 18

add $s0,$t0,$t1 W
IF ID EX MEM s0 $s0
written
here

STALL
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE

STALL
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE

sub $t2,$s0,$t3 R
IF s0 EX MEM WB

$s0 read
here

CSE-4821 by Professor Dr. M. A. Rouf 38


Control Hazards
A control hazard is when we need to find
the destination of a branch, and can’t
fetch any new instructions until we
know that destination.

A branch is either
– Taken: PC <= PC + 4 + Immediate
– Not Taken: PC <= PC + 4

CSE-4821 by Professor Dr. M. A. Rouf 39


Control Hazards Control Hazard on Branches
Three Stage Stall

ALU
10: beq r1,r3,36 Ifetch Reg DMem Reg

ALU
Ifetch Reg DMem Reg
14: and r2,r3,r5

ALU
Reg
18: or r6,r1,r7 Ifetch Reg DMem

ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9

ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg

The penalty when branch take is 3 cycles!


CSE-4821 by Professor Dr. M. A. Rouf 40
Branch Hazards
• Just stalling for each branch is not
practical
• Common assumption: branch not taken
• When assumption fails: flush three
instructions
Program Time (in clock cycles)
execution CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
order
(in instructions)

40 beq $1, $3, 7 IM Reg DM Reg

44 and $12, $2, $5 IM Reg DM Reg

48 or $13, $6, $2 IM Reg DM Reg

52 add $14, $2, $2 IM Reg DM Reg

72 lw $4, 50($7) CSE-4821 by Professor Dr.IMM. A. Rouf


Reg DM Reg 41
Basic Pipelined Processor

In our original Design, branches have a penalty of 3 cycles


CSE-4821 by Professor Dr. M. A. Rouf 42
Control Hazard Solutions
• Stall
– stop loading instructions until result is available
• Predict
– assume an outcome and continue fetching (undo
if prediction is wrong)
– lose cycles only on mis-prediction
• Delayed branch
– specify in architecture that the instruction
immediately following branch is always executed
CSE-4821 by Professor Dr. M. A. Rouf 43
Static Branch Prediction
For every branch encountered during execution predict
whether the branch will be taken or not taken.

Predicting branch not taken:


1. Speculatively fetch and execute in-line instructions following the branch
2. If prediction incorrect flush pipeline of speculated instructions
• Convert these instructions to NOPs by clearing pipeline registers
• These have not updated memory or registers at time of flush

Predicting branch taken:


1. Speculatively fetch and execute instructions at the branch target address
2. Useful only if target address known earlier than branch outcome
• May require stall cycles till target address known
• Flush pipeline if prediction is incorrect
• Must ensure that flushed instructions do not update memory/registers

CSE-4821 by Professor Dr. M. A. Rouf 44


Control Hazard - Stall
0 2 4 6 8 10 12 16 18

add $r4,$r5,$r6 IF ID EX MEM WB

beq $r0,$r1,tgt IF ID EX MEM WB

STALL
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE

sw $s4,200($t5) IF ID EX MEM WB
beq
writes PC new PC
here used here
CSE-4821 by Professor Dr. M. A. Rouf 45
Control Hazard - Correct Prediction
0 2 4 6 8 10 12 16 18

add $r4,$r5,$r6 IF ID EX MEM WB

beq $r0,$r1,tgt IF ID EX MEM WB

tgt:
sw $s4,200($t5) IF ID EX MEM WB

Fetch assuming
branch taken

CSE-4821 by Professor Dr. M. A. Rouf 46


Control Hazard - Incorrect Prediction
0 2 4 6 8 10 12 16 18

add $r4,$r5,$r6 IF ID EX MEM WB

beq $r0,$r1,tgt IF ID EX MEM WB

tgt:
sw $s4,200($t5) IF
(incorrect - ST ALL) BUBBLE BUBBLE BUBBLE BUBBLE

or $r8,$r8,$r9 IF ID EX MEM WB

“Squashed”
instruction
CSE-4821 by Professor Dr. M. A. Rouf 47
INSTRUCTION
LEVEL
PARALLELISM
LECTURE 3

1. SCOREBOARD AND TOMASULO


ALGORITHMS
ILP CHALLENGES
• CPI of pipeline= Ideal CPI+ Structural stalls + Data Hazard
Stalls + Control Hazard Stalls
• Instruction level parallelism can be increased inside a
basic block
• A basic block is a straight-line code sequence with no
branches except at entry or exit of basic block.
• Usually dynamic branch frequency is 15%-25%
• There is a branch between four to seven instructions
• ILP within a block can not improve the ILP
• We need a overlapping across multiple basic blocks

2
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
ILP CHALLENGES (CONTROL FLOW
GRAPH)

3
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
DEFINITION OF ILP
ILP=Potential overlap of execution among instructions.
Overlapping possible if:
• No Structural Hazards
• No RAW, WAR or WAW Stalls
• No Control Stalls

4
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
HARDWARE SCHEMES
TO EXPLOIT ILP
Why?
• Works when can’t know real dependence at compile time
• Compiler Simpler
• Code for one machine runs well on another

5
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
KEY IDEA:
• Allow instructions behind stall to
proceed
• Enables out-of-order execution and
completion (commit).
• First implemented in CDC 6600
(1963).

6
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
EXAMPLE:
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F12,F8,F14

• ADDD surely stalls for F0 (waiting that


DIVD commits).
• DIVD is a floating point operation and
require several cycle to be completed
• SUBD would stall without dynamic
scheduling.

7
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
DEPENDENCE
• Data Dependence: True dependence
• Name dependence: RAW or WAW or RAR
• Control Dependence
• Data Dependency:
• Instruction i produces a result that may be used by
instruction j
• Instruction j is data dependent on instruction k and k is
data dependent on instruction i (transitive dependency)

8
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
DEPENDENCE

Data Dependence

9
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
NAME DEPENDENCE
• Name dependence:
• It occurs when two or more instructions use the
same register or memory location, called a name
but there is no flow of data between instructions
associated with the name.
• Two types of name dependence
• Anti dependence: Write After Read (WAR)
• Output dependence: Write After Write (WAW)

10
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD SCHEME
• ID stage splitted in two parts:
• Issue (decode and check structural
hazards.).
• Read Operands (wait until no data
hazards).
• Scoreboard allow instructions without
dependencies to execute.

11
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD IMPLICATIONS
• Out-of-order completion -> WAR and
WAW hazards.
• Solutions for WAR:
• Queue both the operations and copies of its
operands.
• Read registers only during Read Operands stage.

12
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD IMPLICATIONS
• For WAW, the machine stalls until the
other instruction completes
• Multiple execution units
• Scoreboard keeps track of dependencies
and state of operations.

13
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
FOUR STAGES OF
SCOREBOARD CONTROL
1. Issue
• Decode instructions & check for structural hazards.
• If a functional unit for the instruction is free and no other
active instruction has the same destination register
(WAW), the scoreboard issues the instruction to the
functional unit and updates its internal data structure.
• If a structural or a WAW hazard exists, then the instruction
issue stalls, and no further instructions will issue until these
hazards are cleared.

14
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
FOUR STAGES OF
SCOREBOARD CONTROL
2. Read Operands
• Wait until no data hazards, then read operands
• A source operand is available if:
- no earlier issued active instruction will write it or
- A functional unit is writing its value in a register
• When the source operands are available, the scoreboard
tells the functional unit to proceed to read the operands
from the registers and begin execution.
• RAW hazards are resolved dynamically in this step, and
instructions may be sent into execution out of order.

15
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
FOUR STAGES OF
SCOREBOARD CONTROL
3. Execution
• Operate on operands
• The functional unit begins execution upon receiving
operands. When the result is ready, it notifies the
scoreboard that it has completed execution.

FUs are characterized by:


- latency (the effective time used to complete one
operation).
- Initiation interval (the number of cycles that must
elapse between issuing two operations to the same
functional unit).

16
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
FOUR STAGES OF
PIPELINE CONTROL
4. Write result
• Finish execution
• Once the scoreboard is aware that the
unit has completed execution, the
scoreboard checks for WAR hazards.
• If none, it writes results.
• If WAR, then it stalls the instruction.

17
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
WAR EXAMPLE

DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F8,F8,F14
In this case, the scoreboard would stall the SUBD
in the WB stage,waiting that ADDD reads F0 and F8.

18
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD STRUCTURE
1. Instruction status
2. Functional Unit status
Indicates the state of the functional unit (FU):
Busy – Indicates whether the unit is busy or not
Op - The operation to perform in the unit (+,-, etc.)
Fi - Destination register
Fj, Fk – Source register numbers
Qj, Qk – Functional units producing source registers
Rj, Rk – Flags indicating when Fj, Fk are ready
3. Register result status.
Indicates which functional unit will write each register.
Blank if no pending instructions will write that register.

19
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE
Instruction status Read Execution Write
Instruction j k Issue operandscompleteResult
LD F6 34+ R2
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
FU

20
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 1
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Integer

21
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 2
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Integer

Integer Pipeline Full – Cannot exec 2nd Load – Issue stalls

22
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 3
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Integer

23
• CSE-4821
Issue stalls
Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 4
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Integer

24
• Issue stalls
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 5
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Integer

25
InCSE-4821
this cycle the 2nd load is issued.
Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 6
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6
MULTDF0 F2 F4 6
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 Integer

26
• MULT is issued but has to wait for F2
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 7
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add Yes Sub F8 F6 F2 Integer Yes No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 Integer Add

27
Now, SUBD can be issued, but has to wait for operand F2 to read.
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 8A

Instruction status Read Execution


W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add Yes Sub F8 F6 F2 Integer Yes No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 Integer Add Divide

28
• CSE-4821
DIVD is issued but there is another RAW hazard
Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 8B

Instruction status Read Execution


W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Sub F8 F6 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 Add Divide

29
• Load completes, and operands for MULT and SUBD are ready
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 9

Instruction status Read Execution


W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
10 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
2 Add Yes Sub F8 F6 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
9 FU Mult1 Add Divide

30
MULT and SUB are sent in execution in parallel
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 11
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
8 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
0 Add Yes Sub F8 F6 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
11 FU Mult1 Add Divide

31
• CSE-4821
TheDr.SUBD finishes
M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 12

Instruction status Read Execution


W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
7 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
12 FU Mult1 Divide

32
• Read operands for DIVD?
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 13

Instruction status Read Execution


W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
6 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
13 FU Mult1 Add Divide

33
• CSE-4821
SUBD writes results and ADDD can be issued
Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 14
Instruction status Read Execution
Write
Instruction j k Issueoperands
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVDF10 F0 F6 8
ADDD F6 F8 F2 13 14
Functional unit status dest S1 S2 FU for FU
j for F
k j? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
5 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
2 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
14 FU Mult1 Add Divide

34
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 15
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
4 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
1 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
15 FU Mult1 Add Divide

35
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 16
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
3 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
0 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
16 FU Mult1 Add Divide

36
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 17

Instruction status Read Execution


W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
2 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
17 FU Mult1 Add Divide

37
• CSE-4821
Write result of ADDD? NO, there is a WAR hazard
Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 18
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
1 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
18 FU Mult1 Add Divide

38
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, • Stall, nothing to do
Gazipur
SCOREBOARD
EXAMPLE CYCLE 19
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
0 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
19 FU Mult1 Add Divide

39
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, • MULT, execution completed
Gazipur
SCOREBOARD
EXAMPLE CYCLE 20
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
20 FU Add Divide

40
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, • MULTD write back result
Gazipur
SCOREBOARD
EXAMPLE CYCLE 21
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
21 FU Add Divide

41
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, • DIVD, read operand of F0
Gazipur
SCOREBOARD
EXAMPLE CYCLE 22
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
40 Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
22 FU Divide

42
Now DIVD
CSE-4821 Dr. M. A.can read
Rouf, Dept. itsDUET,
of CSE, operands, and ADDD can write the result
Gazipur
SCOREBOARD
EXAMPLE CYCLE 61
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
0 Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
61 FU Divide

43
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, • DIVD finishes execution
Gazipur
SCOREBOARD
EXAMPLE CYCLE 62
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61 62
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
0 Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
62 FU

44
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
CDC 6600 SCOREBOARD
Achieves a speedup of 2.5 w.r.t. no dynamic
scheduling
By reorganizing instructions the compiler
achieves only 1.7
But
• No cache
• No forwarding hardware
• Limited to instructions in a basic block
• Small number of functional units (structural hazards)
• Wait for WAR hazards
• Prevent WAW hazards

45
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
BRANCH PREDICTION
Current DLX wastes one cycle but other
architectures compute branches several
cycles after the IF stage.
We need to predict ASAP branch result (ID
stage).
Performance of Branch Prediction depends
on:
• Accuracy measured in terms of percentage of
misprediction
• Cost of Misprediction measured in terms of the time
wasted to execute unuseful instructions.

46
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
BRANCH HISTORY TABLE
Table of 1 bit values
Indexed by the lower bits of the PC address
Says whether or not branch taken last time

47
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
BRANCH HISTORY TABLE
Problem: in a loop, 1 bit BHT will cause two mispredictions:
1. When we arrive to the end of the loop and we must exit.
Here the BHT predicts to stay in the loop.
2. When we re-enter the loop, we reach the end and we
must stay in the loop. Here the BHT predicts to exit

48
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
DYNAMIC BRANCH PREDICTION
It is a 2 bit scheme in which we change prediction only if we
get misprediction twice.
For each index of the table, the 2 bits report the state of a
state machine (next slide).
When we arrive at the end of the loop, we dont change
prediction.

49
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
WE CAN DESCRIBE THE
ALGORITHM WITH A FSM

50
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
BRANCH HISTORY
TABLE ACCURACY
We have a misprediction when
• We make a wrong guess for that branch
• Because the same index can be referenced by two different
branches, sometimes we get the history of the wrong branch

51
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
BRANCH HISTORY
TABLE ACCURACY
It has been measured that a 4096 entry
table, programs have a misprediction
percentage from 1% to 18%:
• Nasa7, tomcatv 1%
• Eqntott 18%
• Spice 9%
• Gcc 12%
4096 about as good as infinite table (for the
Alpha 21164)

52
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
CORRELATING BRANCHES
Basic hypothesis: recent branches are
correlated, i.e., behavior of recently
executed branches affects the prediction of
current branch:

53
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
CORRELATING
BRANCHES EXAMPLE
subi R3,R1,2
bnez R3,L1
add r1,r0,r0; bb1
If(a==2) bb1;
L1: subi r3,r1,2
L1: If(b==2) bb2;
bnez r3,L2
L2: If(a!=b) bb3;
add r2,r0,r0; bb2
L2: sub r3,r1,r2
beqz r3,L3
...; bb3
L3:

Branch L2 is correlated previous branches.


If both are not taken then L2 is taken.

54
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
IDEA:
record m most recently executed branches as taken or not
taken. Use that pattern to select the proper branch history
table.

55
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
EXAMPLE OF A SIMPLE
CORRELATING
PREDICTOR
The branch is predicted on the basis of the
previously executed one by selecting the
appropriate 1 bit BHT.
Branch Prediction Table 1 1 Branch Prediction Table
if last branch taken 0 1 if last branch not taken
.... ....

Branch to be predicted

Last branch result

56
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
effective branch result
Gazipur
(M,N) PREDICTORS
In general, (m,n) predictor means
record last m branches to select
between 2^m, n-bit history tables.

57
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
EXAMPLE OF A (2,2)
CORRELATING BRANCH
PREDICTOR

Each cell of the predictor represents the state of a


2 bit branch predictor.

58
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
ACCURACY OF DIFFERENT
SCHEMES
18%

16%

14%
4096 Entries 2-bit BHT
Unlimited Entries 2-bit BHT
Frequency of Mispredictions

12% 11%
1024 Entries (2,2) BHT
10%

8%
6% 6% 6%
6% 5% 5%
4%
4%

2% 1% 1%
0%
0%
doducd

gcc
nasa7

eqntott
espresso
spice

fpppp
tomcatv

li
matrix300

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

59
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
ADDRESS MUST ALSO
BE PREDICTED
Access in the IF stage the Branch Target Buffer:
Tipical Entry:

Exact Address of a Predicted PC (only if not


branch sequential)

60
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
BRANCH TARGET
BUFFER STRUCTURE
Pc of fetched instruction
Associative lookup Predicted PC

No, instruction is not predicted


To be a branch, proceed normally
=
Yes, instruction is a a branch,

61
PC should be used as next PC
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
BRANCH TARGET BUFFER

62
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD SCHEME
IMPLEMENTS THE ILP

LECTURE 3 CONTD..

PROF. DR. M. A. ROUF


DEPT. OF CSE, DUET, GAZIPUR
KEY IDEA OF ILP:
• Allow instructions behind stall to proceed
• Enables out-of-order execution and completion
(commit).
• First implemented in CDC 6600 (1963): The CDC
6600 was the flagship of the 6000 series of mainframe computer
systems manufactured by Control Data Corporation

CDC 6600 (1963)

2
EXAMPLE:
1. DIVD F0,F2,F4
2. ADDD F10,F0,F8
3. SUBD F12,F8,F14

• ADDD surely stalls for F0 (waiting that


DIVD commits).
• DIVD is a floating point operation and
require several cycle to be completed
• SUBD would stall without dynamic
scheduling.

3
DEPENDENCE
1. Data Dependence: True dependence
2. Name dependence: WAR or WAW
3. Control Dependence
• Data Dependency:
• Instruction i produces a result that may be used by
instruction j
• Instruction j is data dependent on instruction k and k is
data dependent on instruction i (transitive dependency)

4
DEPENDENCE

Data Dependence

Name Dependence

5
NAME DEPENDENCE
• Name dependence:
• Name dependence occurs when two or more
instructions use the same register or memory
location, called a name but there is no flow of
data between instructions associated with the
name.
• Two types of name dependence
• Anti dependence: Write After Read (WAR)
• Output dependence: Write After Write (WAW)

6
SCOREBOARD SCHEME 4 STAGES

1. ID stage splitted in two parts:


a) Issue Stage (Decode and check
structural hazards.)
b) Read Operands Stage (Wait until no
data hazards).
2. EX stage: Scoreboard allow instructions
without dependencies to execute out-of-
order.
3. Write result stage

7
SCOREBOARD IMPLICATIONS
• Out-of-order completion -> WAR and
WAW hazards are resolved by:
1. Queue both the operations and copies of its
operands.
2. Read registers only during Read Operands
stage.

8
SCOREBOARD IMPLICATIONS
1. For WAW, the machine stalls write
result until the previous instruction
completes
2. Multiple execution units
3. Scoreboard keeps track of
dependencies and state of
operations.

9
FOUR STAGES OF
SCOREBOARD CONTROL

1. Issue
• Decode instructions and check for structural hazards.
• If a functional unit for the instruction is free and no
other active instruction has the same destination
register (WAW), the scoreboard issues the instruction
to the functional unit and updates its internal data
structure.
• If a structural or a WAW hazard exists, then the
instruction issue stalls, and no further instructions will
issue until these hazards are cleared.

10
FOUR STAGES OF
SCOREBOARD CONTROL
2. Read Operands
a) Wait until no data hazards, then read operands
b) A source operand is available if:
- no earlier issued active instruction will write it or
- A functional unit is writing its value in a register
c) When the source operands are available, the
scoreboard tells the functional unit to proceed to
read the operands from the registers and begin
execution.
d) RAW hazards are resolved dynamically in this step,
and instructions may be sent into execution out of
order.

11
FOUR STAGES OF
SCOREBOARD CONTROL
3. Execution Stage
• Operate on operands
• The functional unit begins execution upon receiving
operands. When the result is ready, it notifies the
scoreboard that it has completed execution.
• FUs are characterized by:
- Latency (the effective time used to complete one
operation).
- Initiation interval (the number of cycles that must
elapse between issuing two operations to the same
functional unit).

12
FOUR STAGES OF
PIPELINE CONTROL
4. Write result
• Finish execution
• Once the scoreboard is aware that the
unit has completed execution, the
scoreboard checks for WAR hazards.
• If none, it writes results.
• If WAR, then write result is stalled.

13
WAR EXAMPLE
Register Renaming
DIVD F0,F2,F4 DIVD F0,F2,F4
ADDD F10,F0,F9
ADDD F10,F0,F8
SUBD F8,F9,F14
SUBD F8,F8,F14
In this case, the scoreboard would stall the SUBD
in the WB stage,waiting that ADDD reads F0 and F8.

14
SCOREBOARD STRUCTURE
1. Instruction status
2. Functional Unit status
Indicates the state of the functional unit (FU):
a) Busy – Indicates whether the unit is busy or not
b) Op - The operation to perform in the unit (+,-, etc.)
c) Fi - Destination register
d) Fj, Fk – Two Source Register Numbers
e) Qj, Qk – Functional units producing source registers
f) Rj, Rk – Flags indicating when Fj, Fk are ready
3. Register result status.
• Indicates which functional unit will write each register.
Blank if no pending instructions will write that register.

15
ASSUMPTIONS FOR
EXAMPLE
1. Load/store unit :
a) Address calculation unit :1
b) Latency :1 cycle
2. ALU/Integer unit : Execution latency 2 cycles
3. Floating point MultD: Execution latency 10
cycles
4. Floating point DivD: Execution latency 40
cycles

16
SCOREBOARD EXAMPLE
Instruction status Read Execution
Write
Instruction j k Issue operands
complete
Result
LD F6 34+ R2
LD F2 45+ R3
MULTD F0 F2 F4
SUBDF8 F6 F2
DIVDF10 F0 F6
ADDDF6 F8 F2
Functional unit status dest S1 S2 FU for FU
j for Fk j? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
Divide No

17
Register result status
SCOREBOARD
EXAMPLE CYCLE 1
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Integer

18
SCOREBOARD
EXAMPLE CYCLE 2
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Integer

Integer Pipeline Full – Cannot exec 2nd Load – Issue stalls

19
SCOREBOARD
EXAMPLE CYCLE 3
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Integer

20
• Issue stalls
SCOREBOARD
EXAMPLE CYCLE 4
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Integer

21
• Issue stalls due to single LD/ST unit and single address
calculation unit
SCOREBOARD
EXAMPLE CYCLE 5
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Integer

22
In this cycle the 2nd load is issued.
SCOREBOARD
EXAMPLE CYCLE 6
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6
MULTDF0 F2 F4 6
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 Integer
• MULT is issued but has to wait for F2

23
SCOREBOARD
EXAMPLE CYCLE 7
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add Yes Sub F8 F6 F2 Integer Yes No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 Integer Add

24
• Now, SUBD can be issued, but has to wait for operand F2 to read.
SCOREBOARD
EXAMPLE CYCLE 8A

Instruction status Read Execution


W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add Yes Sub F8 F6 F2 Integer Yes No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 Integer Add Divide

25
• DIVD is issued but there is another RAW hazard for F0
SCOREBOARD
EXAMPLE CYCLE 8B

Instruction status Read Execution


W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Sub F8 F6 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 Add Divide

26
• Load completes, and operands for MULTD and SUBD are ready
SCOREBOARD
EXAMPLE CYCLE 9

Instruction status Read Execution


W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
10 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
2 Add Yes Sub F8 F6 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
9 FU Mult1 Add Divide
• MULT and SUBD Read operands and are sent for

27
execution in parallel
SCOREBOARD
EXAMPLE CYCLE 11
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
8 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
0 Add Yes Sub F8 F6 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
11 FU Mult1 Add Divide

28
• The SUBD finishes execution
SCOREBOARD
EXAMPLE CYCLE 12

Instruction status Read Execution


W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
7 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
12 FU Mult1 Divide

29
• Read operands for DIVD: Can not read F0 before MULTD
writes F0
SCOREBOARD
EXAMPLE CYCLE 13

Instruction status Read Execution


W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
6 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
13 FU Mult1 Add Divide

30
• SUBD writes results and ADDD can be issued
SCOREBOARD
EXAMPLE CYCLE 14
Instruction status Read Execution
Write
Instruction j k Issueoperands
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVDF10 F0 F6 8
ADDD F6 F8 F2 13 14
Functional unit status dest S1 S2 FU for FU
j for F
k j? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
5 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
2 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
14 FU Mult1 Add Divide

31
• ADDD can read operands
SCOREBOARD
EXAMPLE CYCLE 15
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
4 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
1 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
15 FU Mult1 Add Divide

32
• ADDD can executes on operands
SCOREBOARD
EXAMPLE CYCLE 16
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
3 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
0 Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
16 FU Mult1 Add Divide

33
• ADDD finishes execution
SCOREBOARD
EXAMPLE CYCLE 17

Instruction status Read Execution


W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
2 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
17 FU Mult1 Add Divide

34
• Write result of ADDD? NO, there is a WAR hazard for F6
SCOREBOARD
EXAMPLE CYCLE 18
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
1 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
18 FU Mult1 Add Divide

35
• Stall continued, nothing to do
SCOREBOARD
EXAMPLE CYCLE 19
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
0 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
19 FU Mult1 Add Divide

36
• MULTD, execution completed
SCOREBOARD
EXAMPLE CYCLE 20
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
20 FU Add Divide

37
• MULTD write back result
SCOREBOARD
EXAMPLE CYCLE 21
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
21 FU Add Divide

38
• DIVD, read operand of F0
SCOREBOARD
EXAMPLE CYCLE 22
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
40 Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
22 FU Divide

• Now DIVD can execute on its operands, and ADDD can write

39
the result
SCOREBOARD
EXAMPLE CYCLE 61
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
0 Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
61 FU Divide

40
• DIVD finishes execution after 40 cycles
SCOREBOARD
EXAMPLE CYCLE 62
Instruction status Read Execution
W rite
Instruction j k Issue operandscompleteResult
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61 62
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
0 Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
62 FU

41
• DIVD finishes after writing back results
Lecture 3.3
Tomasulo’s Algorithm
Dynamic Scheduling Using Tomasulo’s
Algorithm
• It is used in IBM 360/91 floating point unit
• Invented by Robert Tomasulo of IBM
• It keeps tracks whenever operands are available
• It minimizes RAW hazards
• It introduces register renaming to minimize WAW
and RAW hazards

CSE-4821 Tomasulo's Algorithm 2


Hardware Speculation (“Boosting”)

• Issue an instruction dependent on branch before the branch result is


known.
• Commit is always made in order.
• Commit of a speculative instruction is made only when the branch
outcome is known.
• The same holds for exceptions (synchronous or asynchronous)
deviations of control flow

CSE-4821 Tomasulo's Algorithm 3


Speculative Tomasulo’s Algorthm

• Tomasulo’s “Boosting” needs a buffer for uncommited


results which is called reorder buffer (ROB).
• Each entry is:
Instruction Destination Value

• ROB has a slot for each issued instruction.


• When an instruction writes into a register, it writes only in its
assigned slot in the ROB.
• The reorder buffer can be a operand source
• The reservation station (RS) or load buffers
• Destination like register file (RF) and store buffers

CSE-4821 Tomasulo's Algorithm 4


Tomasulo’s ROB (cont.)

•Reservation station (RS) now only queue


instructions to FUs (to reduce structural
hazards)
•Pointers, now, are directed toward ROB
slots.
•It has a common data bus (CDB) for
forwarding common results to those
waiting for operands
CSE-4821 Tomasulo's Algorithm 5
Four steps of speculative Tomasulo’s Algorithm

1. Issue: get an instruction from the queue. RS &&


ROB must have a free slot. Dispatch the
operation indicating in which slot it must write
2. Execution: When both operands are ready, it is
executed. If not watch in the CDB.
3. Write Result:Write on CDB and on ROB
4. Commit: the commited instruction at head of
the ROB updates destination register and is
removed. Mispredicted branches flush the ROB
(“graduation”).
CSE-4821 Tomasulo's Algorithm 6
Speculative Tomasulo’s algorithm

CSE-4821 Tomasulo's Algorithm 7


Tomasulo Algorithm

•Invented at IBM 3 years after CDC 6600 for


the IBM 360/91
•Same Goal: performance without special
compilers
•Lead to:
• Alpha 21264, HP 8000, MIPS 10000, Pentium II,
PowerPC 604

CSE-4821 Tomasulo's Algorithm 8


Tomasulo Algorithm Basics

The control logic and the buffers are


distributed with FUs
Operand buffers are called reservation
stations.
Each instruction is an entry of a reservation
station.
Its operands are replaced by values or
pointers (Register Renaming)

CSE-4821 Tomasulo's Algorithm 9


Tomasulo Algorithm Basics
Register Renaming allows to:
Avoid WAR and WAW hazards
Reservation stations are more than
registers (so can do better optimizations
than a compiler).
Results are dispatched to other FUs through
a Common Data Bus
Load/Stores treated as FUs

CSE-4821 Tomasulo's Algorithm 10


Tomasulo Algorithm for an FPU

CSE-4821 Tomasulo's Algorithm 11


Reservation Station Components
•Tag identifying the RS
•OP=the operation to perform on the
component.
•Vj, Vk=Value of the source operands
•Qj,Qk=Pointers to RS that produce Vj,Vk
•Busy=Indicates RS Busy

CSE-4821 Tomasulo's Algorithm 12


Other components
RF and the Store buffer have a Value
(V) and a Pointer (Q) field.
Load buffers have an address field, and
a busy field.
Store Buffers have also an address
field.

CSE-4821 Tomasulo's Algorithm 13


The three stages of the Tomasulo Algorithm.

 ISSUE.
 Get an instruction I from the queue. If it is an FP op.
Check if an RS is empty (i.e., check for structural
hazards).
 Rename registers;
 WAR resolution: For instruction I and J. I is an
instruction which reads Rx, then if J writes Rx. J
already knows the value of Rx or knows what
instruction will write it. So the RF can be linked to I.
 WAW resolution: Since we use in-order issue, the
RF can be linked to I.

CSE-4821 Tomasulo's Algorithm 14


The Three Stages of The Tomasulo
Algorithms
Execution:
When both operands are ready then
execute. If not ready, watch the common
data bus fo results
Write result
Write on Common Data Bus (CDB) to all
waiting units; mark reservation stations
available.

CSE-4821 Tomasulo's Algorithm 15


The Common Data Bus
•A common data bus is a data+source
bus.
•In the IBM 360/91
Data=64 bits, Source=4 bits
•FU must perform associative lookup in
the RS.

CSE-4821 Tomasulo's Algorithm 16


Tomasulo (IBM) versus Scoreboard (CDC)
1. Multiple but not
1. Pipelined Fus
pipelined Fus
2. Issue window size=14
2. Issue window size=5
3. No issue on structural
3. No issue on structural
hazards
hazards
4. WAR, WAW avoided with
4. Stall the completion for
renaming
WAW and WAR hazards
5. Broadcast results from FU
5. Results written back on
6. Control distributed on RS registers.
6. Control centralized
through the Scoreboard.

CSE-4821 Tomasulo's Algorithm 17


Tomasulo (IBM) versus Scoreboard (CDC)

Tomasulo (IBM) Scoreboard (CDC)


1. Multiple but not pipelined
1. Pipelined FUs FUs
2. Issue window size=14 2. Issue window size=5
3. No issue on structural 3. No issue on structural
hazards hazards
4. WAR, WAW avoided with 4. Stall the completion for
renaming WAW and WAR hazards
5. Broadcast results from FU 5. Results written back on
registers.
6. Control distributed on RS
6. Control centralized through
the Scoreboard.

CSE-4821 Tomasulo's Algorithm 18


Three Stages of Tomasulo Algorithm

1. Issue—get instruction from FP Op


Queue
a) Stall if structural hazard, i.e. no space in the rs.
b) If reservation station (rs) is free, the issue logic
issues instr to rs & read operands into rs if
ready (Register renaming => Solves WAR).
c) Make status of destination register waiting for
this latest instr even if the previous instr
writing to this register hasn’t completed =>
Solves WAW hazards.

CSE-4821 Tomasulo's Algorithm 19


Three Stages of Tomasulo Algorithm

2. Execution—operate on operands (EX)


When both operands are ready then execute;
if not ready, watch CDB for result – Solves RAW
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting units;
mark reservation station available. Write result
into dest. reg. if its status is r. => Solves WAW.

CSE-4821 Tomasulo's Algorithm 20


Three Stages of Tomasulo Algorithm

•Normal data bus: data + destination


(“go to” bus)
•CDB: data + source (“come from” bus)
•64 bits of data + 4 bits of Functional Unit
source address
•Write if matches expected Functional Unit
(produces result)
•Does broadcast

CSE-4821 Tomasulo's Algorithm 21


Reservation Station Components
Op—Operation to perform in the unit (e.g., + or –)
Vj, Vk— Value of the source operand.
Qj, Qk— Name of the RS that would provide the source operands.
Value zero means the source operands already available in Vj or
Vk, or is not necessary.
Busy—Indicates reservation station or FU is busy

Register File Status Qi:


Qi —Indicates which functional unit will write each register, if one
exists. Blank (0) when no pending instructions that will write that
register meaning that the value is already available.

CSE-4821 Tomasulo's Algorithm 22


Tomasulo Example Cycle 0
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 Load1 No
LD F2 45+ R3 Load2 No
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
0 Add2 No
Add3 No
0 Mult1 No
0 Mult2 No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
0 FU
CSE-4821 Tomasulo's Algorithm 23
Tomasulo Example Cycle 1
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 Load1 Yes 34+R2
LD F2 45+ R3 Load2 No
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
0 Add2 No
Add3 No
0 Mult1 No
0 Mult2 No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Load1
CSE-4821 Tomasulo's Algorithm 24
Tomasulo Example Cycle 2
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2- Load1 Yes 34+R2
LD F2 45+ R3 2 Load2 Yes 45+R3
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2 Assume Load takes 2 cycles
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
0 Add2 No
Add3 No
0 Mult1 No
0 Mult2 No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Load2 Load1
CSE-4821 Tomasulo's Algorithm 25
Tomasulo Example Cycle 3
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 Load1 Yes 34+R2
LD F2 45+ R3 2 3- Load2 Yes 45+R3
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
0 Add2 No read value
Add3 No
0 Mult1 Yes Mult R(F4) Load2
0 Mult2 No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Mult1 Load2 Load1
CSE-4821 Tomasulo's Algorithm 26
Tomasulo Example Cycle 4
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 Load2 Yes 45+R3
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 Yes Sub M(A1) Load2
0 Add2 No
Add3 No
0 Mult1 Yes Mult R(F4) Load2
0 Mult2 No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Mult1 Load2 M(A1) Add1
CSE-4821 Tomasulo's Algorithm 27
Tomasulo Example Cycle 5
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4
DIVD F10 F0 F6 5
ADDD F6 F8 F2
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
2 Add1 Yes Sub M(A1) M(A2)
0 Add2 No
Add3 No
10 Mult1 Yes Mult M(A2) R(F4)
0 Mult2 Yes Div M(A1) Mult1
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Mult1 M(A2) M(A1) Add1 Mult2
CSE-4821 Tomasulo's Algorithm 28
Tomasulo Example Cycle 6
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- Load3 No
SUBD F8 F6 F2 4 6 --
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
1 Add1 Yes Sub M(A1) M(A2)
0 Add2 Yes Add M(A2) Add1
Add3 No
9 Mult1 Yes Mult M(A2) R(F4)
0 Mult2 Yes Div M(A1) Mult1
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 M(A2) Add2 Add1 Mult2
CSE-4821 Tomasulo's Algorithm 29
Tomasulo Example Cycle 7
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- Load3 No
SUBD F8 F6 F2 4 6 -- 7
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 Yes Sub M(A1) M(A2)
0 Add2 Yes Add M(A2) Add1
Add3 No
8 Mult1 Yes Mult M(A2) R(F4)
0 Mult2 Yes Div M(A1) Mult1
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 M(A2) Add2 Add1 Mult2
CSE-4821 Tomasulo's Algorithm 30
Tomasulo Example Cycle 8
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
2 Add2 Yes Add M1-M2 M(A2)
Add3 No
7 Mult1 Yes Mult M(A2) R(F4)
0 Mult2 Yes Div M(A1) Mult1
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 M(A2) Add2 M1-M2 Mult2
CSE-4821 Tomasulo's Algorithm 31
Tomasulo Example Cycle 9
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 --
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
1 Add2 Yes Add M1-M2 M(A2)
Add3 No
6 Mult1 Yes Mult M(A2) R(F4)
0 Mult2 Yes Div M(A1) Mult1
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
9 FU Mult1 M(A2) Add2 M1-M2 Mult2
CSE-4821 Tomasulo's Algorithm 32
Tomasulo Example Cycle 10
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 -- 10
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
0 Add2 Yes Add M1-M2 M(A2)
Add3 No
5 Mult1 Yes Mult M(A2) R(F4)
0 Mult2 Yes Div M(A1) Mult1
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
10 FU Mult1 M(A2) Add2 M1-M2 Mult2
CSE-4821 Tomasulo's Algorithm 33
Tomasulo Example Cycle 11
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 -- 10 11
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
Add2 No
Add3 No
4 Mult1 Yes Mult M(A2) R(F4)
0 Mult2 Yes Div M(A1) Mult1
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
11 FU Mult1 M(A2) M1-M2+M(A2)
M1-M2 Mult2
CSE-4821 Tomasulo's Algorithm 34
Tomasulo Example Cycle 12
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 -- 10 11
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
Add2 No
Add3 No
4 Mult1 Yes Mult M(A2) R(F4)
0 Mult2 Yes Div M(A1) Mult1
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
12 FU Mult1 M(A2) M1-M2+M(A2)
M1-M2 Mult2
CSE-4821 Tomasulo's Algorithm 35
Tomasulo Example Cycle 15
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- 15 Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 -- 10 11
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
Add2 No
Add3 No
0 Mult1 Yes Mult M(A2) R(F4)
0 Mult2 Yes Div M(A1) Mult1
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
15 FU Mult1 M(A2) M1-M2+M(A2)
M1-M2 Mult2
CSE-4821 Tomasulo's Algorithm 36
Tomasulo Example Cycle 16
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- 15 16 Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 -- 10 11
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
Add2 No
Add3 No
Mult1 No
40 Mult2 Yes Div M*F4 M(A1)
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
16 FU M*F4 M(A2) M1-M2+M(A2)
M1-M2 Mult2
CSE-4821 Tomasulo's Algorithm 37
Tomasulo Example Cycle 56
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- 15 16 Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5 17 -- 56
ADDD F6 F8 F2 6 9 -- 10 11
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
Add2 No
Add3 No
Mult1 No
0 Mult2 Yes Div M*F4 M(A1)
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
56 FU M*F4 M(A2) M1-M2+M(A2)
M1-M2 Mult2
CSE-4821 Tomasulo's Algorithm 38
Tomasulo Example Cycle 57
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- 15 16 Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5 17 -- 56 57
ADDD F6 F8 F2 6 9 -- 10 11
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
Add2 No
Add3 No
Mult1 No
0 Mult2 No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
57 FU M*F4 M(A2) M1-M2+M(A2)
M1-M2 result
CSE-4821 Tomasulo's Algorithm 39
CSE-4821
Advanced Computer Architecture
L3.4: VLIW Architecture
Dr. M. A. Rouf
Dept. of CSE, DUET, Gazipur
Basic Working Principles of VLIW
• Aim at speeding up computation by exploiting
instruction-level parallelism.
• Same hardware core as superscalar processors, having
multiple execution units (EUs) working in parallel.
• An instruction is consisted of multiple operations;
typical word length from 52 bits to 1 Kbits.
• All operations in an instruction are executed in a lock-
step mode.
• One or multiple register files for FX and FP data.
• Rely on compiler to find parallelism and schedule
dependency free program code.

CSE-4821, L3.4 VLIW Architecture for ILP,


2
Dr. M. A. Rouf, Dept. of CSE, DUET
Comparison of VLIW, CISC,RISC

CSE-4821, L3.4 VLIW Architecture for ILP,


3
Dr. M. A. Rouf, Dept. of CSE, DUET
Generating of VLIW instruction words

A hypothetical VLIW processor architecture


CSE-4821, L3.4 VLIW Architecture for ILP,
4
Dr. M. A. Rouf, Dept. of CSE, DUET
Basic VLIW Approach

CSE-4821, L3.4 VLIW Architecture for ILP,


5
Dr. M. A. Rouf, Dept. of CSE, DUET
Register File Structure for VLIW

• What is the challenge to register file in VLIW?


CSE-4821, L3.4 VLIW Architecture for ILP,
R/W ports
6
Dr. M. A. Rouf, Dept. of CSE, DUET
Differences Between VLIW &
Superscalar Architecture (I)

CSE-4821, L3.4 VLIW Architecture for ILP,


7
Dr. M. A. Rouf, Dept. of CSE, DUET
Differences Between VLIW &
Superscalar Architecture (II)
• Instruction formulation:
– Superscalar:
• Receive conventional instructions conceived for seq. processors.
– VLIW:
• Receive (very) long instruction words, each comprising a field (or
opcode) for each execution unit.
• Instruction word length depends (a) number of execution units, and (b)
code length to control each unit (such as opcode length, register
names, …).
• Typical word length is 64 – 1024 bits, much longer than conventional
machine word length.

CSE-4821, L3.4 VLIW Architecture for ILP,


8
Dr. M. A. Rouf, Dept. of CSE, DUET
Differences Between VLIW &
Superscalar Architecture (III)
• Instruction scheduling:
– Superscalar:
• Done dynamically at run-time by the hardware.
• Data dependency is checked and resolved in hardware.
• Need a lookahead hardware window for instruction fetch.

CSE-4821, L3.4 VLIW Architecture for ILP,


9
Dr. M. A. Rouf, Dept. of CSE, DUET
Differences Between VLIW &
Superscalar Architecture (IV)
• Instruction scheduling (cont’d):
– VLIW:
• Static scheduling done at compile-time by the compiler.
• Advantages:
– Reduce hardware complexity.
– Tasks such as decoding, data dependency detection,
instruction issue, …, etc. becoming simple.
– Potentially higher clock rate.
– Higher degree of parallelism with global program
information.

CSE-4821, L3.4 VLIW Architecture for ILP,


10
Dr. M. A. Rouf, Dept. of CSE, DUET
Differences Between VLIW &
Superscalar Architecture (V)
• Instruction scheduling (cont’d):
– VLIW:
• Disadvantages
– Higher complexity of the compiler.
– Compiler optimization needs to consider technology
dependent parameters such as latencies and load-use
time of cache.
(Question: What happens to the software if the hardware
is updated?)
– Non-deterministic problem of cache misses, resulting in
worst case assumption for code scheduling.
– In case of un-filled opcodes in a (V)LIW, memory space
and instruction bandwidth are wasted.
CSE-4821, L3.4 VLIW Architecture for ILP,
11
Dr. M. A. Rouf, Dept. of CSE, DUET
Development history of
Proposed/Commercial VLIWs

CSE-4821, L3.4 VLIW Architecture for ILP,


12
Dr. M. A. Rouf, Dept. of CSE, DUET
Case Study of VLIW: Trace 200 Family (I)

CSE-4821, L3.4 VLIW Architecture for ILP,


13
Dr. M. A. Rouf, Dept. of CSE, DUET
Case Study of VLIW: Trace 200 Family (II)

• Only two branches might be used in Trace 7/2000


CSE-4821, L3.4 VLIW Architecture for ILP,
14
Dr. M. A. Rouf, Dept. of CSE, DUET
Code Expansion in VLIW
• It is found that code in VLIW is expanded roughly by
a factor of three.
• For “long” VLIW, more opcode fields will be emptied.
This will result in wasting bandwidth and storage
space. Can you propose a solution for it?

CSE-4821, L3.4 VLIW Architecture for ILP,


15
Dr. M. A. Rouf, Dept. of CSE, DUET
END

CSE-4821, L3.4 VLIW Architecture for ILP,


16
Dr. M. A. Rouf, Dept. of CSE, DUET
Compiler techniques
for exposing ILP

CSE-4821 Advanced Computer Architecture


Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
Instruction Level Parallelism
• Potential overlap among instructions
• Few possibilities in a basic block
– Blocks are small (6-7 instructions)
– Instructions are dependent
• Goal: Exploit ILP across multiple basic
blocks
– Iterations of a loop
for (i = 1000; i > 0; i=i-1)
x[i] = x[i] + s;

CSE-4821 Compiler Techniques 2


for ILP, by Dr. M. A. Rouf
Basic Scheduling
Sequential MIPS Assembly Code
for (i = 1000; i > 0; i=i-1) Loop: LD F0, 0(R1)
ADDD F4, F0, F2
x[i] = x[i] + s; SD 0(R1), F4
SUBI R1, R1, #8
BNEZ R1, Loop
Pipelined execution:
Loop: LD F0, 0(R1) 1 Scheduled pipelined execution:
stall 2 Loop: LD F0, 0(R1) 1
ADDD F4, F0, F2 3 SUBI R1, R1, #8 2
stall 4 ADDD F4, F0, F2 3
stall 5 stall 4
SD 0(R1), F4 6 BNEZ R1, Loop 5
SUBI R1, R1, #8 7 SD 8(R1), F4 6
stall 8
BNEZ R1, Loop 9
stall 10
CSE-4821 Compiler Techniques 3
for ILP, by Dr. M. A. Rouf
Loop Unrolling
Unrolling 4 Times
for (i = 1000; i > 0; i=i-4)
{
x[i] = x[i] + s;
x[i-1] = x[i-1] + s;
x[i-2] = x[i-2] + s;
x[i-3] = x[i-3] + s;
}

CSE-4821 Compiler Techniques 4


for ILP, by Dr. M. A. Rouf
Loop Unrolling
Loop: LD F0, 0(R1)
ADDD F4, F0, F2
SD 0(R1), F4
Pros: SUBI R1, R1, #8
Larger basic block BEQZ R1, Exit
More scope for scheduling LD F6, 0(R1)
ADDD F8, F6, F2
and eliminating dependencies SD 0(R1), F8
SUBI R1, R1, #8
Cons: BEQZ R1, Exit
Increases code size LD F10, 0(R1)
ADDD F12, F10, F2
SD 0(R1), F12
Comment: SUBI R1, R1, #8
Often a possibility for BEQZ R1, Exit
other optimizations LD F14, 0(R1)
ADDD F16, F14, F2
SD 0(R1), F16
SUBI R1, R1, #8
BNEZ R1, Loop
Exit:

CSE-4821 Compiler Techniques 5


for ILP, by Dr. M. A. Rouf
Loop Transformations
• Instruction independency is the key
requirement for the transformations
• Example
– Determine that is legal to move SD after SUBI and
BNEZ
– Determine that unrolling is useful (iterations are
independent)
– Use different registers to avoid unnecessary constrains
– Eliminate extra tests and branches
– Determine that LD and SD can be interchanged
– Schedule the code, preserving the semantics of the
CSE-4821 Compiler Techniques 6
code for ILP, by Dr. M. A. Rouf
1. Eliminating Name Dependences
Loop: LD F0, 0(R1) Loop: LD F0, 0(R1)
ADDD F4, F0, F2 ADDD F4, F0, F2
SD 0(R1), F4 Rename F0 SD 0(R1), F4
LD F0, -8(R1) register in LD LD F6, -8(R1)
to remove
ADDD F4, F0, F2 ADDD F8, F6, F2
dependency
SD -8(R1), F4 SD -8(R1), F8
LD F0, -16(R1) LD F10, -16(R1)
ADDD F4, F0, F2 Register Renaming ADDD F12, F10, F2
SD -16(R1), F4 SD -16(R1), F12
LD F0, -24(R1) LD F14, -24(R1)
ADDD F4, F0, F2 ADDD F16, F14, F2
SD -24(R1), F4 SD -24(R1), F16
SUBI R1, R1, #32 SUBI R1, R1, #32
BNEZ R1, Loop CSE-4821 Compiler Techniques BNEZ R1, Loop 7
for ILP, by Dr. M. A. Rouf
2. Eliminating Control Dependences
Loop: LD F0, 0(R1)
ADDD F4, F0, F2 Intermediate BEQZ are never taken
SD 0(R1), F4
SUBI R1, R1, #8 Eliminate!
BEQZ R1, Exit
LD F6, 0(R1)
ADDD F8, F6, F2
SD 0(R1), F8
SUBI R1, R1, #8
BEQZ R1, Exit
LD F10, 0(R1)
ADDD F12, F10, F2
SD 0(R1), F12
SUBI R1, R1, #8
BEQZ R1, Exit
LD F14, 0(R1)
ADDD F16, F14, F2
SD 0(R1), F16
SUBI R1, R1, #8
BNEZ R1, Loop CSE-4821 Compiler Techniques 8
Exit: for ILP, by Dr. M. A. Rouf
3. Eliminating Data Dependences
Loop: LD F0, 0(R1)
ADDD F4, F0, F2 • Data dependencies SUBI, LD, SD
SD 0(R1), F4 Force sequential execution of iterations
SUBI R1, R1, #8
• Compiler removes this dependency by:
LD F6, 0(R1)
ADDD F8, F6, F2
Computing intermediate R1 values
SD 0(R1), F8 Eliminating intermediate SUBI
SUBI R1, R1, #8 Changing final SUBI
LD F10, 0(R1)
• Data flow analysis
ADDD F12, F10, F2
SD 0(R1), F12
Can do on Registers
SUBI R1, R1, #8 Cannot do easily on memory locations
LD F14, 0(R1) 100(R1) = 20(R2)
ADDD F16, F14, F2
SD 0(R1), F16
SUBI R1, R1, #8
CSE-4821 Compiler Techniques 9
BNEZ R1, Loop for ILP, by Dr. M. A. Rouf
4. Alleviating Data Dependencies
Unrolled loop: Scheduled Unrolled loop:
Loop: LD F0, 0(R1) Loop: LD F0, 0(R1)
ADDD F4, F0, F2 LD F6, -8(R1)
SD 0(R1), F4 LD F10, -16(R1)
LD F6, -8(R1) LD F14, -24(R1)
ADDD F8, F6, F2 ADDD F4, F0, F2
SD -8(R1), F8 ADDD F8, F6, F2
LD F10, -16(R1) ADDD F12, F10, F2
ADDD F12, F10, F2 ADDD F16, F14, F2
SD -16(R1), F12 SD 0(R1), F4
LD F14, -24(R1) SD -8(R1), F8
ADDD F16, F14, F2 SUBI R1, R1, #32
SD -24(R1), F16 SD 16(R1), F12
SUBI R1, R1, #32 BNEZ R1, Loop
BNEZ R1, Loop SD 8(R1), F16

CSE-4821 Compiler Techniques 10


for ILP, by Dr. M. A. Rouf
Some General Comments
• Dependences are a property of programs
• Actual hazards are a property of the pipeline
• Techniques to avoid dependence limitations
– Maintain dependences but avoid hazards
• Code scheduling
– hardware
– software
– Eliminate dependences by code transformations
• Complex
• Compiler-based

CSE-4821 Compiler Techniques 11


for ILP, by Dr. M. A. Rouf
Loop-level Parallelism
• Primary focus of dependence analysis
• Determine all dependences and find cycles
for (i=1; i<=100; i=i+1) {
x[i] = y[i] + z[i];
w[i] = x[i] + v[i];
}
for (i=1; i<=100; i=i+1) {
x[i+1] = x[i] + z[i]; x[1] = x[1] + y[1];
} for (i=1; i<=99; i=i+1) {
y[i+1] = w[i] + z[i];
for (i=1; i<=100; i=i+1) {
x[i+1] = x[i +1] + y[i +1];
x[i] = x[i] + y[i]; }
y[i+1] = w[i] + z[i]; y[101] = w[100] + z[100];
}
CSE-4821 Compiler Techniques 12
for ILP, by Dr. M. A. Rouf
Dependence Analysis Algorithms

• Assume array indexes are affine (ai + b)


– GCD test:
For two affine array indexes ai+b and ci+d:
if a loop-carried dependence exists, then GCD (c,a) must
divide (d-b)
x[8*i ] = x[4*i + 2] +3
(2-0)/GCD(8,4)
• General graph cycle determination is NP
• a, b, c, and d may not be known at compile time

CSE-4821 Compiler Techniques 13


for ILP, by Dr. M. A. Rouf
Software Pipelining
Start-up

Finish-up

Iteration 0 Iteration 1 Iteration 2 Iteration 3

Software pipelined iteration CSE-4821 Compiler Techniques 14


for ILP, by Dr. M. A. Rouf
Example
Iteration i Iteration i+1 Iteration i+2
LD F0, 0(R1)
ADDD F4, F0, F2 LD F0, 0(R1)
SD 0(R1), F4 ADDD F4, F0, F2 LD F0, 0(R1)
SD 0(R1), F4 ADDD F4, F0, F2
SD 0(R1), F4

Loop: LD F0, 0(R1) Loop: SD 16(R1), F4


ADDD F4, F0, F2 ADDD F4, F0, F2
SD 0(R1), F4 LD F0, 0(R1)
SUBI R1, R1, #8 SUBI R1, R1, #8
BNEZ R1, Loop BNEZ R1, Loop
CSE-4821 Compiler Techniques 15
for ILP, by Dr. M. A. Rouf
Trace (global-code)
Scheduling
• Find ILP across conditional branches
• Two-step process
– Trace selection
• Find a trace (sequence of basic blocks)
• Use loop unrolling to generate long traces
• Use static branch prediction for other conditional
branches
– Trace compaction
• Squeeze the trace into a small number of wide
instructions
• Preserve data and control dependences
CSE-4821 Compiler Techniques 16
for ILP, by Dr. M. A. Rouf
Trace Selection
A[I] = A[I] + B[I] LW R4, 0(R1)
LW R5, 0(R2)
T F ADD R4, R4, R5
A[I] = 0?
SW 0(R1), R4
BNEZ R4, else

B[I] = X ....
SW 0(R2), . . .
J join
Else: ....
X
C[I] = Join: ....
CSE-4821 Compiler Techniques 17
for ILP, by Dr. M. A. Rouf SW 0(R3), . . .
Summary of Compiler
Techniques
• Try to avoid dependence stalls
• Loop unrolling
– Reduce loop overhead
• Software pipelining
– Reduce single body dependence stalls
• Trace scheduling
– Reduce impact of other branches
• Compilers use a mix of three
• All techniques depend on prediction
accuracy
CSE-4821 Compiler Techniques 18
for ILP, by Dr. M. A. Rouf
CSE 4821 Advanced Computer Architecture

Thread Level Parallelism

Dr. M. A. Rouf
Dept. of CSE
DUET, Gazipur
Performance beyond single thread ILP
• There can be much higher natural parallelism in
some applications
(e.g., Database or Scientific codes)
• Explicit Thread Level Parallelism or Data Level
Parallelism
• Thread: process with own instructions and data
• thread may be a process part of a parallel program of multiple
processes, or it may be an independent program
• Each thread has all the state (instructions, data, PC, register state, and
so on) necessary to allow it to execute
• Data Level Parallelism: Perform identical operations
on data, and lots of data

CSE-4821 Dr. M. A. Rouf, Dept. of


2
CSE, DUET, Gazipur
Thread Level Parallelism (TLP)
• ILP exploits implicit parallel operations within a
loop or straight-line code segment
• TLP explicitly represented by the use of multiple
threads of execution that are inherently parallel
• Goal: Use multiple instruction streams to
improve
1. Throughput of computers that run many programs
2. Execution time of multi-threaded programs
• TLP could be more cost-effective to exploit than
ILP

CSE-4821 Dr. M. A. Rouf, Dept. of


3
CSE, DUET, Gazipur
New Approach: Mulithreaded Execution
• Multithreading: multiple threads to share the functional
units of 1 processor via overlapping
• processor must duplicate independent state of each thread e.g., a separate
copy of register file, a separate PC, and for running independent programs, a
separate page table
• memory shared through the virtual memory mechanisms, which already
support multiple processes
• HW for fast thread switch; much faster than full process switch  100s to
1000s of clock cycles
• When switch?
• Alternate instruction per thread (fine grain) in each cycle
• When a thread is stalled, perhaps for a cache miss, another thread can be
executed (coarse grain)

CSE-4821 Dr. M. A. Rouf, Dept. of


4
CSE, DUET, Gazipur
Fine-Grained Multithreading

• Switches between threads on each instruction, causing the execution


of multiples threads to be interleaved
• Usually done in a round-robin fashion, skipping any stalled threads
• CPU must be able to switch threads every clock
• Advantage: It is it can hide both short and long stalls, since
instructions from other threads executed when one thread stalls
• Disadvantage: it slows down execution of individual threads, since a
thread ready to execute without stalls will be delayed by instructions
from other threads
• Used on Sun’s Niagara (will see later)

CSE-4821 Dr. M. A. Rouf, Dept. of


5
CSE, DUET, Gazipur
Coarse-Grained Multithreading
• Switches threads only on costly stalls, such as L2 cache misses
• Advantages
• Relieves need to have very fast thread-switching
• Doesn’t slow down thread, since instructions from other threads
issued only when the thread encounters a costly stall
• Disadvantage is hard to overcome throughput losses from shorter
stalls, due to pipeline start-up costs
• Since CPU issues instructions from 1 thread, when a stall occurs, the
pipeline must be emptied or frozen
• New thread must fill pipeline before instructions can complete
• Because of this start-up overhead, coarse-grained multithreading is
better for reducing penalty of high cost stalls, where pipeline refill <<
stall time
• Used in IBM AS/400

CSE-4821 Dr. M. A. Rouf, Dept. of


6
CSE, DUET, Gazipur
For most apps, most execution units lie idle
For an 8-way
superscalar.

From: Tullsen,
Eggers, and Levy,
“Simultaneous
Multithreading:
Maximizing On-chip
Parallelism, ISCA
1995.
CSE-4821 Dr. M. A. Rouf, Dept. of
7
CSE, DUET, Gazipur
Do both ILP and TLP?

• TLP and ILP exploit two different kinds of parallel


structure in a program
• Could a processor oriented at ILP to exploit TLP?
• functional units are often idle in data path designed for ILP because
of either stalls or dependences in the code
• Could the TLP be used as a source of independent
instructions that might keep the processor busy
during stalls?
• Could TLP be used to employ the functional units
that would otherwise lie idle when insufficient ILP
exists?

CSE-4821 Dr. M. A. Rouf, Dept. of


8
CSE, DUET, Gazipur
Simultaneous Multi-threading ...
One thread, 8 units Two threads, 8 units
Cycle M M FX FX FP FP BR CC Cycle M M FX FX FP FP BR CC
1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes


CSE-4821 Dr. M. A. Rouf, Dept. of 9
CSE, DUET, Gazipur
Simultaneous Multithreading (SMT)
• Simultaneous multithreading (SMT): insight that dynamically
scheduled processor already has many HW mechanisms to support
multithreading
• Large set of virtual registers that can be used to hold the register sets of
independent threads
• Register renaming provides unique register identifiers, so instructions from
multiple threads can be mixed in datapath without confusing sources and
destinations across threads
• Out-of-order completion allows the threads to execute out of order, and
get better utilization of the HW
• Just adding a per thread renaming table and keeping separate PCs
• Independent commitment can be supported by logically keeping a separate
reorder buffer for each thread

Source: Micrprocessor Report, December 6, 1999


“Compaq Chooses SMT for Alpha”
CSE-4821 Dr. M. A. Rouf, Dept. of
10
CSE, DUET, Gazipur
Multithreading Categories FUs: 1 2 3 4
Pipes: 1 2 3 4 New Thread/cyc Many Cyc/thread Separate Jobs Simultaneous
Superscalar Fine-Grained Coarse-Grained Multiprocessing Multithreading
Time (processor cycle)

16/48 = 33.3% 27/48 = 56.3% 27/48 = 56.3% 29/48 = 60.4% 42/48 = 87.5%
Thread 1 Thread 3 Thread 5
Thread 2 Thread 4 Idle slot
CSE-4821 Dr. M. A. Rouf, Dept. of
11
CSE, DUET, Gazipur
Design Challenges in SMT
• Since SMT makes sense only with fine-grained implementation,
impact of fine-grained scheduling on single thread performance?
• A preferred thread approach sacrifices neither throughput nor single-thread
performance?
• Unfortunately, with a preferred thread, the processor is likely to sacrifice some
throughput, when preferred thread stalls
• Larger register file needed to hold multiple contexts
• Not affecting clock cycle time, especially in
• Instruction issue - more candidate instructions need to be considered
• Instruction completion - choosing which instructions to commit may be
challenging
• Ensuring that cache and TLB conflicts generated by SMT do not
degrade performance

CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur 12

You might also like