1 Rouf

Case Study: Intel Processors
Courtesy: Intel Corp.
By: Dr. M. A. Rouf

Professor, Dept. of CSE, DUET, Gazipur
Course Introduction
• Course Teacher:
– Professor Dr. Mohammad Abdur Rouf
– Dr. Mohammad Jakirul Islam
• Course Website:
– https://sites.google.com/a/duet.ac.bd/marouf-
cse/courses-2018/cse-4821-advanced-computer-
archtecture
CMOS VLSI Design

Case Study: Intel Processors Slide 2
Course Introduction
• Email:
– Dr. M. A. Rouf
• rouf7606@gmail.com, marouf.cse@duet.ac.bd
– Dr. Jakirul Islam
• jakirduet@gmail.com
• Zoom Class Room Meeting link will be notified
via CR or some suitable forum
CMOS VLSI Design

Course Introduction
• Attendance
– Attendance will be taken during class time
• Class material and ppt slides will be uploaded earlier
• It is advisable to download and print the slide before
class time
• If the class meeting is disturbed due to power
disruption or network failure it will be solved after
discussion
CMOS VLSI Design

Outline
• Evolution of Intel Microprocessors
– Scaling from 4004 to Pentium 4
– Courtesy of Intel Museum
CMOS VLSI Design

4004
• First microprocessor (1971)
– For Busicom calculator of Nippon
Calculator
• Characteristics
– 10 mm process
– 2300 transistors
– 400 – 800 kHz
– 4-bit word size
– 16-pin DIP package
• Intel 4004 was a part of MCS-4 chipset,
which included the following chips:
– 4001 - 256-bit mask ROM and 4-bit I/O
device,
– 4002 - 320-bit RAM and 4-bit I/O device,
– 4003 - 10-bit shift register,
– 4008 and 4009 - standard memory and
I/O interface set.
CMOS VLSI Design

Slide 6
8008
• 8-bit follow-on (1972)
– Dumb terminals
• Characteristics
– 10 mm process
– 500 – 800 kHz
– 8-bit word size
– 16 KB Physical memory
CMOS VLSI Design

8080
• 16-bit address bus (1974)
– Used in Altair computer
• (early hobbyist PC)
• Characteristics
– 6 mm process
– 2 MHz
– 8-bit word size
CMOS VLSI Design

8086 / 8088
• 16-bit processor (1978-9)
– IBM PC and PC XT
– Revolutionary products
– Introduced x86 ISA
• Characteristics
– 3 mm process
– 29k transistors
– 5-10 MHz
– 16-bit word size
• Microcode ROM
CMOS VLSI Design

80286
• Virtual memory (1982)
– IBM PC AT
• Characteristics
– 1.5 mm process
– 134k transistors
– 6-12 MHz
– 68-pin PGA
• Regular datapaths and
ROMs
Bitslices clearly visible
CMOS VLSI Design

80386
• 32-bit processor (1985)
– Modern x86 ISA
• Characteristics
– 1.5-1 mm process
– 275k transistors
– 16-33 MHz
– 100-pin PGA
• 32-bit datapath,
microcode ROM,
synthesized control
CMOS VLSI Design

80486
• Pipelining (1989)
– Floating point unit
– 8 KB cache
• Characteristics
– 1-0.6 mm process
– 1.2M transistors
– 25-100 MHz
– 168-pin PGA (Pin Grid Array)
• Cache, Integer datapath,
FPU, microcode,
synthesized control
CMOS VLSI Design

Pentium
• Superscalar (1993)
– 2 instructions per cycle
– Separate 8KB I$ & D$
• Characteristics
– 0.8-0.35 mm process
– 3.2M transistors
– 60-300 MHz
– 296-pin PGA
• Caches, datapath,
FPU, control
CMOS VLSI Design

Pentium Pro / II / III
• Dynamic execution (1995-9)
– 3 micro-ops / cycle
– Out of order execution
– 16-32 KB I$ & D$
– Multimedia instructions
– PIII adds 256+ KB L2$
• Characteristics
– 0.6-0.18 mm process
– 5.5M-28M transistors
– 166-1000 MHz
– Multi-chip Module (MCM)
– Single Edge Contact Cartridge
(SECC)
CMOS VLSI Design

Pentium 4
• Deep pipeline (2001)
– 20 stage pipeline
– Very fast clock
– 256-1024 KB L2$
• Characteristics
– 180 – 90 nm process
– 42-125M transistors
– 1.4-3.4 GHz
– 478-pin PGA
• Units start to become
invisible on this scale
CMOS VLSI Design

Core i3
• Processor cores: 2
– 45nm process
– Power optimized front slide bus
– Radix-16 technology divider adds:
• Divider and square root in same chip.
– Deeper buffers
– 14 stage efficient pipeline
– Micro and Macro Ops Fusion
– Additional ALU
– Advanced Branch Prediction
CMOS VLSI Design
Comparison of Different cores
Features Core i3 Core i5 Core i7
Cores 2 4 4
Hyper- Yes No yes
threading
Turbo Boost No Yes Yes
K-Model No Yes Yes
Cache 2-4 MB 4-6MB 8MB
Clock 3.4-4.2 GHz 2.4 – 3.8 GHz 2.9-4.2 GHz
CMOS VLSI Design

Summary
• 104 increase in transistor count, clock
frequency over 30 years!
CMOS VLSI Design

LECTURE1:
FUNDAMENTAL OF
COMPUTER DESIGN
DR. M. A. ROUF PH.D.
(KAIST)
DHAKA UNIVERSITY OF ENGINEERING AND
TECHNOLOGY (DUET)
CSE-4821: ADVANCED COMPUTER
ARCHITECTURE
SINGLE PROCESSOR PERFORMANCE
2
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
3
CLASSES OF
COMPUTERS
Personal Mobile Device (PMD)
• e.g. smart phones, tablet computers
• Emphasis on energy efficiency and real-time for media apps
Desktop Computing
• Emphasis on price-performance
4
CLASSES OF
COMPUTERS (CONTD..)
Servers
•
Emphasis on availability, scalability, throughput
•
Clusters / Warehouse Scale Computers
•
Used for “Software as a Service (SaaS)”
•
Emphasis on availability and price-performance
•
Sub-class: Supercomputers, emphasis: floating-point
performance and fast internal networks
Embedded Computers
• Microwaves, washing machines, printers, networking switches
• Emphasis: price
5
CURRENT TRENDS
Cannot continue to exploit Instruction-Level parallelism (ILP)
• Single processor performance improvement ended in 2003
New models for performance:

• Data-level parallelism (DLP)
• Thread-level parallelism (TLP)
• Request-level parallelism (RLP)
• These require explicit restructuring of applications
6
PARALLELISM
Classes of parallelism in applications:
• Data-Level Parallelism (DLP)
• Task-Level Parallelism (TLP)
Classes of architectural parallelism:
• Instruction-Level Parallelism (ILP)
• Exploit DLP
• Vector architectures/Graphic Processor Units (GPUs)
• Exploit DLP
• Thread-Level Parallelism
• Exploit DLP or TLP
• Request-Level Parallelism
• Exploit TLP
7
LAYER OF SYSTEM ARCHITECTURE
8
DEFINING COMPUTER
ARCHITECTURE
The task of computer designer:
Determine what attributes are important for a new
computer, then design a computer to maximize performance
while staying within cost, power, and availability constrains
9
DEFINING COMPUTER
ARCHITECTURE
This task has many aspects:
• Instruction set design
• Functional organization
• Logic design
• And implementation
Also,
• Integrated circuit design
• Packaging
• Power
• Cooling
AND
• Optimization, including a lot of technologies (complier, OS…)
10
INSTRUCTION SET
ARCHITECTURE (ISA)
The instruction set architecture
serves as the boundary between
the software and hardware.
11
TRENDS IN
TECHNOLOGY
To evaluate a computer, designer must
be aware of rapid changes in
implementation technology
• Integrated circuit logic:
• transistor density increase by about 35% per year
• Increase in die size is ranging from 10% to 20%
per year
• The combined effect is a growth rate in transistor
count on a chip is about 40%~55% per year
12
TRENDS IN
TECHNOLOGY
• DRAM (dynamic random-access memory):
• Capacity increases by about 40% per year, doubling
roughly every two years
• Magnetic disk technology

• Before 1990: 30% per year, doubling in 3 years
• 1996~2004: from 60% to 100% increase per year
• After 2004: drop back to 30% per year
• Despite this roller coaster of rates of improvement, it is
still 50-100 times cheaper than DRAM
• Flash Memory
• LAN
13
CSE 4821
Advanced Computer Architecture
LECTURE 3
INSTRUCTION SET PRINCIPLES, PIPELINE HAZARDS
PROF. DR. M. A. ROUF

DEPT. OF CSE, DUET
INSTRUCTION SET DESIGN ISSUES
Instruction set design issues include:

• Where are operands stored?
• registers, memory, stack, accumulator
• How many explicit operands are there?
• 0, 1, 2, or 3
• How is the operand location specified?
• register, immediate, indirect, . . .
• What type & size of operands are supported?
• byte, int, float, double, string, vector. . .
• What operations are supported?
• add, sub, mul, move, compare . . .
2
EVOLUTION OF INSTRUCTION SETS
Single Accumulator (EDSAC 1950, Maurice Wilkes)
Accumulator + Index Registers
(Manchester Mark I, IBM 700 series 1953)
Separation of Programming Model

from Implementation
High-level Language Based Concept of a Family

(B5000 1963) (IBM 360 1964)
General Purpose Register Machines
Complex Instruction Sets Load/Store Architecture

(Vax, Intel 432 1977-80) (CDC 6600, Cray 1 1963-76)
CISC RISC
Intel x86, Pentium (MIPS,Sparc,HP-PA,IBM RS6000,PowerPC . . .1987)
3
CLASSIFYING ISAS
Accumulator (before 1960, e.g. 68HC11):
1-address add A acc acc + mem[A]
Stack (1960s to 1970s):

0-address add tos tos + next
Memory-Memory (1970s to 1980s):

2-address add A, B mem[A] mem[A] + mem[B]
3-address add A, B, C mem[A] mem[B] + mem[C]
Register-Memory (1970s to present, e.g. 80x86):

2-address add R1, A R1 R1 + mem[A]
load R1, A R1 mem[A]
Register-Register (Load/Store) (1960s to present, e.g. MIPS):

3-address add R1, R2, R3 R1 R2 + R3
load R1, R2 R1 mem[R2]
store R1, R2 mem[R1] R2
4
OPERAND LOCATIONS IN FOUR ISA CLASSES
GPR
5
WORD-ORIENTED MEMORY
ORGANIZATION
32-bit 64-bit Bytes Addr.
Words Words
0000
Memory is byte addressed and Addr
= 0001
provides access for bytes (8 0000
?? 0002
bits), half words (16 bits), words Addr
= 0003
(32 bits), and double words(64 0000
?? 0004
bits). Addr
= 0005
0004
?? 0006
0007
Addresses Specify Byte Locations 0008
Addr
• Address of first byte in word = 0009
• Addresses of successive words differ 0008
??
Addr
0010
by 4 (32-bit) or 8 (64-bit) = 0011
0008
??
0012
Addr
= 0013
0012
?? 0014
0015
6
BYTE ORDERING
How should bytes within multi-byte word be ordered in memory?
Conventions
• Sun’s, Mac’s are “Big Endian” machines
• Least significant byte has highest address
• Alphas, PC’s are “Little Endian” machines
• Least significant byte has lowest address
7
BYTE ORDERING EXAMPLE
Big Endian
• Least significant byte has highest address
Little Endian
• Least significant byte has lowest address
Example
• Variable x has 4-byte representation 0x01234567
• Address given by &x is 0x100
Big Endian 0x100 0x101 0x102 0x103

01
01 23
23 45
45 67
67
Little Endian 0x100 0x101 0x102 0x103
67
67 45
45 23
23 01
01
8
TYPES OF OPERATIONS
Arithmetic and Logic: AND, ADD
Data Transfer: MOVE, LOAD, STORE
Control: BRANCH, JUMP, CALL
System: OS CALL, VM
Floating Point: ADDF, MULF, DIVF
Decimal: ADDD, CONVERT
String: MOVE, COMPARE
Graphics: (DE)COMPRESS
9
TOP 10 80X86 INSTRUCTIONS
° Rank instruction Integer Average Percent total executed
1 load 22%
2 conditional branch 20%
3 compare 16%
4 store 12%
5 add 8%
6 and 6%
7 sub 5%
8 move register-register 4%
9 call 1%
10 return 1%
Total 96%
° Simple instructions dominate instruction frequency
10
RELATIVE FREQUENCY OF
CONTROL INSTRUCTIONS
Operation SPECint92 SPECfp92

Call/Return 13% 11%
Jumps 6% 4%
Branches 81% 87%
• Design hardware to handle branches quickly,

since these occur most frequently
11
THE MIPS INSTRUCTION FORMATS
All MIPS instructions are 32 bits long. The three instruction formats:
31 26 21 16 11 6 0
• R-type op rs rt rd shamt funct
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits
31 26 21 16 0
• I-type immediate
op rs rt
6 bits 5 bits 5 bits 16 bits
• J-type 31 26 0
op target address
6 bits 26 bits
The different fields are:
• op: operation of the instruction
• rs, rt, rd: the source and destination register specifiers
• shamt: shift amount
• funct: selects the variant of the operation in the “op” field
• address / immediate: address offset or immediate value
• target address: target address of the jump instruction
12
MIPS ADDRESSING MODES/INSTRUCTION FORMATS
• All instructions 32 bits wide
Register (direct) op rs rt rd
register
Immediate op rs rt immed
Displacement
op rs rt immed
Memory
register +
PC-relative
op rs rt immed
Memory
PC +
13
REVIEW: 5-STAGE
EXECUTION
5 canonical stage “RISC” load-store architecture
1. Instruction fetch (IF):
• get instruction from memory/cache
2. Instruction decode, Register read (ID):
• translate opcode into control signals and read regs
3. Execute (EX):
• perform ALU operation, load/store address, branch outcomes
4. Memory (MEM):
• access memory if load/store, everyone else idle
5. Writeback/retire (WB):
• write results to register file
14
SOLUTION
Overlap execution of instructions
• Start instruction on every cycle, e.g. the new instruction can be fetched while the
previous one is decoded – pipeline. Each cycle performing a specific task; number of
stages is called pipeline depth (5 here)
Non-pipelined
time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
Pipelined
15
Pipeline Progress – Instn moves with all control signals, addresses, data items =>
different register lengths at different stages
M
U
X
1 + target
+ PC+1 PC+1
R0 0
R1
eq?
regA ALU
M
instruction
regB R2 result
R3
valA U
Inst A X
PC ALU
Register file
R4 L mdata
mem result
R5 U
valB M Data
R6
U memory
R7 data
X
offset dest
valB
Bits 11-15
M
Bits 16-20 U dest dest dest
X
IF/ ID/ EX/ Mem/

ID EX Mem WB
16
DATA HAZARD - STALLING
0 2 4 6 8 10 12 16 18
add $s0,$t0,$t1 W
IF ID EX MEM s0 $s0
written
here
STALL
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE
STALL
sub $t2,$s0,$t3 R
IF s0 EX MEM WB
$s0 read
here
17
DATA HAZARDS
Two different instructions use the same storage
location
• It must appear as if they executed in sequential order
add R1, R2, R3 add R1, R2, R3 add R1, R2, R3

sub R2, R4, R1 sub R2, R4, R1 sub R2, R4, R1
or R1, R6, R3 or R1, R6, R3 or R1, R6, R3
read-after-write write-after-read write-after-write
(RAW) (WAR) (WAW)
True dependence anti dependence output dependence

(real) (artificial) (artificial)
Where (How) do WAR and WAW hazards occur ?
18
Control Hazards CONTROL HAZARD ON BRANCHES
THREE STAGE STALL
ALU
10: beq r1,r3,36 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
14: and r2,r3,r5
ALU
Reg Reg
18: or r6,r1,r7 Ifetch DMem
ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9
ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg
The penalty when branch take is 3 cycles!
19
CONTROL HAZARDS
Branch problem:
• branches are resolved in EX stage
 3 cycles penalty on taken branches
Ideal CPI =1. Assuming 3cycles for all branches and 32% branch
instructions  new CPI = 1 + 0.32*3 = 1.96
Solutions:
• Reduce branch penalty: change the datapath – new adder needed
in ID stage.
• Fill branch delay slot(s) with a useful instruction.
• Fixed branch prediction.
• Static branch prediction.
• Dynamic branch prediction.
20
Pipeline: Hazards
Dr. M. A. Rouf
Professor
Dept. of CSE
Pipelining Outline
• Introduction
– Defining Pipelining
– Pipelining Instructions
• Hazards
– Structural hazards \
– Data Hazards
– Control Hazards
• Performance
• Controller implementation
CSE-4821 by Professor Dr. M. A. Rouf 2
Pipeline Hazards
• Where one instruction cannot immediately follow
another
• Types of hazards
– Structural hazards - attempt to use the same resource
by two or more instructions
– Control hazards - attempt to make branching decisions
before branch condition is evaluated
– Data hazards - attempt to use data before it is ready
• Can always resolve hazards by waiting

Structural Hazards
• Attempt to use the same resource by two or more
instructions at the same time
• Example: Single Memory for instructions and data
–Accessed by IF stage
–Accessed at same time by MEM stage
• Solutions
–Delay the second access by one clock cycle, OR
–Provide separate memories for instructions & data
• This is what the book does
• This is called a “Harvard Architecture”
• Real pipelined processors have separate caches

Pipelined Example -
Executing Multiple Instructions
• Consider the following instruction
sequence:
lw $r0, 10($r1)
sw $sr3, 20($r4)
add $r5, $r6, $r7
sub $r8, $r9, $r10

Clock Cycle 1
LW

Clock Cycle 2
SW LW
7
CSE-4821 by Professor Dr. M. A. Rouf
Clock Cycle 3
ADD SW LW

Clock Cycle 4
SUB ADD SW LW

Clock Cycle 5
SUB ADD SW LW

Clock Cycle 6
SUB ADD SW

Clock Cycle 7
SUB ADD

Clock Cycle 8
SUB

Alternative View - Multicycle Diagram
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8
lw $r0, 10($r1) IM REG ALU DM REG
sw $r3, 20($r4) IM REG ALU DM REG
add $r5, $r6, $r7 IM REG ALU DM REG
sub $r8, $r9, $r10 IM REG ALU DM REG

Alternative View - Multicycle Diagram
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8
lw $r0, 10($r1) IM REG ALU DM REG
Memory Conflict
sw $r3, 20($r4) IM REG ALU DM REG
add $r5, $r6, $r7 IM REG ALU DM REG
sub $r8, $r9, $r10 IM REG ALU DM REG

One Memory Port Structural Hazards
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Load
ALU
I Ifetch Reg DMem Reg
n
s
Instr 1
ALU
Ifetch Reg DMem Reg
t
r.
Instr 2
ALU
Ifetch Reg DMem Reg
O
r
d Stall Bubble Bubble Bubble Bubble Bubble
e
r Instr 3
ALU
Ifetch Reg DMem Reg

Structural Hazards
Some common Structural Hazards:
• Memory:
– we’ve already mentioned this one.
• Floating point:
– Since many floating point instructions require many cycles, it’s easy for them
to interfere with each other.
• Starting up more of one type of instruction than there are
resources.
– For instance, the PA-8600 can support two ALU + two load/store instructions
per cycle - that’s how much hardware it has available.

Structural Hazards
Dealing with Structural Hazards
Stall
• low cost, simple
• Increases CPI
• use for rare case since stalling has performance effect
Pipeline hardware resource
• useful for multi-cycle resources
• good performance
• sometimes complex e.g., RAM
Replicate resource
• good performance
• increases cost (+ maybe interconnect delay)
• useful for cheap or divisible resources

Structural Hazards
• Structural hazards are reduced with these rules:

– Each instruction uses a resource at most once
– Always use the resource in the same pipeline stage
– Use the resource for one cycle only
• Many RISC ISAs are designed with this in mind
• Sometimes very difficult to do this.
– For example, memory of necessity is used in the IF and
MEM stages.

Pipelining Outline
• Introduction
• Hazards
– Structural hazards
– Data Hazards \
– Control Hazards
• Performance
Data Hazards
• Data hazards occur when data is used
before it is ready Time (in clock cycles)
Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
register $2: 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Program
execution
order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg
and $12, $2, $5 IM Reg DM Reg
or $13, $6, $2 IM Reg DM Reg
add $14, $2, $2 IM Reg DM Reg
sw $15, 100($2) IM Reg DM Reg
The use of the result of the SUB instruction in the next three instructions causes a data
hazard, since the register $2 is not written until after those instructions read it.

Data Hazards
Execution Order is:
Read After Write (RAW)
InstrI
InstrJ tries to read operand before InstrI writes it
InstrJ
I: add r1,r2,r3
J: sub r4,r1,r3
• Caused by a “Dependence” (in compiler nomenclature). This hazard results

from an actual need for communication.

Data Hazards
Execution Order is:
Write After Read (WAR)
InstrI
InstrJ tries to write operand before InstrI reads i
InstrJ – Gets wrong operand
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
– Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:

– All instructions take 5 stages, and
– Reads are always in stage 2, and
– Writes are always in stage 5

Data Hazards
Execution Order is:
Write After Write (WAW)
InstrI
InstrJ tries to write operand before InstrI writes it
InstrJ – Leaves wrong result ( InstrI not InstrJ )
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:

– All instructions take 5 stages, and
– Writes are always in stage 5
• Will see WAR and WAW later in more complicated pipes

Data Hazard Detection in MIPS (1)
Read after Write
Time (in clock cycles)
register $2: 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Program
execution IF/ID ID/EX EX/MEM MEM/WB
order
(in instructions)
1a: EX/MEM.RegisterRd = ID/EX.RegisterRs

EX hazard
1b: EX/MEM.RegisterRd = ID/EX.RegisterRt
2a: MEM/WB.RegisterRd = ID/EX.RegisterRs MEM hazard
2b: MEM/WB.RegisterRd = ID/EX.RegisterRt

Data Hazards
• Solutions for Data Hazards
– Stalling
– Forwarding:
• connect new value directly to next stage
– Reordering

Data Hazard - Stalling
0 2 4 6 8 10 12 16 18
add $s0,$t0,$t1 W
IF ID EX MEM s0 $s0
written
here
STALL
STALL
sub $t2,$s0,$t3 R
IF s0 EX MEM WB
$s0 read
here

Data Hazards - Stalling
Simple Solution to RAW
• Hardware detects RAW and stalls

• Assumes register written then read each cycle
+ low cost to implement, simple
-- reduces IPC
• Try to minimize stalls
Minimizing RAW stalls
• Bypass/forward/shortcircuit (We will use the word “forward”)

• Use data before it is in the register
+ reduces/avoids stalls
-- complex
• Crucial for common RAW hazards

•
Data Hazards - Forwarding
Key idea: connect new value directly to next stage
• Still read s0, but ignore in favor of new result
• Problem: what about load instructions?

Data Hazards - Forwarding
• STALL still required for load - data avail. after MEM
• MIPS architecture calls this delayed load, initial
implementations required compiler to deal with this
0 2 4 6 8 10 12 16 18
ID W
lw $s0,20($t1) IF ID EX MEM s0
new value
of s0
STALL
R
sub $t2,$s0,$t3 IF s0 EX MEM WB

Forwarding
Key idea: connect data internally before it's stored
register $2: 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Program
order
(in instructions)
How would you design the forwarding?

Data Hazard Solution: Forwarding
• Key idea: connect data internally before
it's stored Time (in clock cycles)
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Value of register $2 : 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Value of EX/MEM : X X X – 20 X X X X X
Value of MEM/WB : X X X X – 20 X X X X
Program
execution order
(in instructions)
Assumption:
• The register file forwards values that are read and
written during the same
CSE-4821 cycle.
by Professor Dr. M. A. Rouf 32
Data Hazard Summary
• Three types of data hazards
– RAW (MIPS)
– WAW (not in MIPS)
– WAR (not in MIPS)
• Solution to RAW in MIPS
– Stall
– Forwarding
• Detection & Control
– EX hazard
– MEM hazard
• A stall is needed if read a register after a load
instruction that writes the same register.
– Reordering
Pipelining Outline
• Introduction
• Hazards
– Structural hazards
– Data Hazards \
– Control Hazards
• Performance
Data Hazard Review
• Three types of data hazards
– RAW (in MIPS and all others)
– WAW (not in MIPS but many others)
– WAR (not in MIPS but many others)
• Forwarding

Data Hazard Detection in MIPS
Read after Write Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
register $2: 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Program
order
(in instructions)
1a: EX/MEM.RegisterRd = ID/EX.RegisterRs

1b: EX/MEM.RegisterRd = ID/EX.RegisterRt EX hazard
2a: MEM/WB.RegisterRd = ID/EX.RegisterRs
MEM hazard
2b: MEM/WB.RegisterRd = ID/EX.RegisterRt
Problem?
Some instructions do not write register.
EX/MEM.RegWrite must
CSE-4821 be asserted!
by Professor Dr. M. A. Rouf 36
Data Hazards
• Solutions for Data Hazards
– Stalling
– Forwarding:
• connect new value directly to next stage
– Reordering

Data Hazard - Stalling
0 2 4 6 8 10 12 16 18
add $s0,$t0,$t1 W
IF ID EX MEM s0 $s0
written
here
STALL
STALL
sub $t2,$s0,$t3 R
IF s0 EX MEM WB
$s0 read
here

Control Hazards
A control hazard is when we need to find
the destination of a branch, and can’t
fetch any new instructions until we
know that destination.
A branch is either
– Taken: PC <= PC + 4 + Immediate
– Not Taken: PC <= PC + 4

Control Hazards Control Hazard on Branches
Three Stage Stall
ALU
10: beq r1,r3,36 Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
14: and r2,r3,r5
ALU
Reg
18: or r6,r1,r7 Ifetch Reg DMem
ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9
ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg
The penalty when branch take is 3 cycles!

Branch Hazards
• Just stalling for each branch is not
practical
• Common assumption: branch not taken
• When assumption fails: flush three
instructions
Program Time (in clock cycles)
execution CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
order
(in instructions)
40 beq $1, $3, 7 IM Reg DM Reg
44 and $12, $2, $5 IM Reg DM Reg
48 or $13, $6, $2 IM Reg DM Reg
52 add $14, $2, $2 IM Reg DM Reg
72 lw $4, 50($7) CSE-4821 by Professor Dr.IMM. A. Rouf

Reg DM Reg 41
Basic Pipelined Processor
In our original Design, branches have a penalty of 3 cycles

Control Hazard Solutions
• Stall
– stop loading instructions until result is available
• Predict
– assume an outcome and continue fetching (undo
if prediction is wrong)
– lose cycles only on mis-prediction
• Delayed branch
– specify in architecture that the instruction
immediately following branch is always executed
Static Branch Prediction
For every branch encountered during execution predict
whether the branch will be taken or not taken.
Predicting branch not taken:

1. Speculatively fetch and execute in-line instructions following the branch
2. If prediction incorrect flush pipeline of speculated instructions
• Convert these instructions to NOPs by clearing pipeline registers
• These have not updated memory or registers at time of flush
Predicting branch taken:

1. Speculatively fetch and execute instructions at the branch target address
2. Useful only if target address known earlier than branch outcome
• May require stall cycles till target address known
• Flush pipeline if prediction is incorrect
• Must ensure that flushed instructions do not update memory/registers

Control Hazard - Stall
0 2 4 6 8 10 12 16 18
add $r4,$r5,$r6 IF ID EX MEM WB
beq $r0,$r1,tgt IF ID EX MEM WB
STALL
sw $s4,200($t5) IF ID EX MEM WB
beq
writes PC new PC
here used here
Control Hazard - Correct Prediction
0 2 4 6 8 10 12 16 18
tgt:
sw $s4,200($t5) IF ID EX MEM WB
Fetch assuming
branch taken

Control Hazard - Incorrect Prediction
0 2 4 6 8 10 12 16 18
tgt:
sw $s4,200($t5) IF
(incorrect - ST ALL) BUBBLE BUBBLE BUBBLE BUBBLE
or $r8,$r8,$r9 IF ID EX MEM WB
“Squashed”
instruction
INSTRUCTION
LEVEL
PARALLELISM
LECTURE 3
1. SCOREBOARD AND TOMASULO

ALGORITHMS
ILP CHALLENGES
• CPI of pipeline= Ideal CPI+ Structural stalls + Data Hazard
Stalls + Control Hazard Stalls
• Instruction level parallelism can be increased inside a
basic block
• A basic block is a straight-line code sequence with no
branches except at entry or exit of basic block.
• Usually dynamic branch frequency is 15%-25%
• There is a branch between four to seven instructions
• ILP within a block can not improve the ILP
• We need a overlapping across multiple basic blocks
2
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
ILP CHALLENGES (CONTROL FLOW
GRAPH)
3
Gazipur
DEFINITION OF ILP
ILP=Potential overlap of execution among instructions.
Overlapping possible if:
• No Structural Hazards
• No RAW, WAR or WAW Stalls
• No Control Stalls
4
Gazipur
HARDWARE SCHEMES
TO EXPLOIT ILP
Why?
• Works when can’t know real dependence at compile time
• Compiler Simpler
• Code for one machine runs well on another
5
Gazipur
KEY IDEA:
• Allow instructions behind stall to
proceed
• Enables out-of-order execution and
completion (commit).
• First implemented in CDC 6600
(1963).
6
Gazipur
EXAMPLE:
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F12,F8,F14
• ADDD surely stalls for F0 (waiting that

DIVD commits).
• DIVD is a floating point operation and
require several cycle to be completed
• SUBD would stall without dynamic
scheduling.
7
Gazipur
DEPENDENCE
• Data Dependence: True dependence
• Name dependence: RAW or WAW or RAR
• Control Dependence
• Data Dependency:
• Instruction i produces a result that may be used by
instruction j
• Instruction j is data dependent on instruction k and k is
data dependent on instruction i (transitive dependency)
8
Gazipur
DEPENDENCE
Data Dependence
9
Gazipur
NAME DEPENDENCE
• Name dependence:
• It occurs when two or more instructions use the
same register or memory location, called a name
but there is no flow of data between instructions
associated with the name.
• Two types of name dependence
• Anti dependence: Write After Read (WAR)
• Output dependence: Write After Write (WAW)
10
Gazipur
SCOREBOARD SCHEME
• ID stage splitted in two parts:
• Issue (decode and check structural
hazards.).
• Read Operands (wait until no data
hazards).
• Scoreboard allow instructions without
dependencies to execute.
11
Gazipur
SCOREBOARD IMPLICATIONS
• Out-of-order completion -> WAR and
WAW hazards.
• Solutions for WAR:
• Queue both the operations and copies of its
operands.
• Read registers only during Read Operands stage.
12
Gazipur
• For WAW, the machine stalls until the
other instruction completes
• Multiple execution units
• Scoreboard keeps track of dependencies
and state of operations.
13
Gazipur
FOUR STAGES OF
SCOREBOARD CONTROL
1. Issue
• Decode instructions & check for structural hazards.
• If a functional unit for the instruction is free and no other
active instruction has the same destination register
(WAW), the scoreboard issues the instruction to the
functional unit and updates its internal data structure.
• If a structural or a WAW hazard exists, then the instruction
issue stalls, and no further instructions will issue until these
hazards are cleared.
14
Gazipur
FOUR STAGES OF
SCOREBOARD CONTROL
2. Read Operands
• Wait until no data hazards, then read operands
• A source operand is available if:
- no earlier issued active instruction will write it or
- A functional unit is writing its value in a register
• When the source operands are available, the scoreboard
tells the functional unit to proceed to read the operands
from the registers and begin execution.
• RAW hazards are resolved dynamically in this step, and
instructions may be sent into execution out of order.
15
Gazipur
FOUR STAGES OF
SCOREBOARD CONTROL
3. Execution
• Operate on operands
• The functional unit begins execution upon receiving
operands. When the result is ready, it notifies the
scoreboard that it has completed execution.
FUs are characterized by:

- latency (the effective time used to complete one
operation).
- Initiation interval (the number of cycles that must
elapse between issuing two operations to the same
functional unit).
16
Gazipur
FOUR STAGES OF
PIPELINE CONTROL
4. Write result
• Finish execution
• Once the scoreboard is aware that the
unit has completed execution, the
scoreboard checks for WAR hazards.
• If none, it writes results.
• If WAR, then it stalls the instruction.
17
Gazipur
WAR EXAMPLE
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F8,F8,F14
In this case, the scoreboard would stall the SUBD
in the WB stage,waiting that ADDD reads F0 and F8.
18
Gazipur
SCOREBOARD STRUCTURE
1. Instruction status
2. Functional Unit status
Indicates the state of the functional unit (FU):
Busy – Indicates whether the unit is busy or not
Op - The operation to perform in the unit (+,-, etc.)
Fi - Destination register
Fj, Fk – Source register numbers
Qj, Qk – Functional units producing source registers
Rj, Rk – Flags indicating when Fj, Fk are ready
3. Register result status.
Indicates which functional unit will write each register.
Blank if no pending instructions will write that register.
19
Gazipur
SCOREBOARD
EXAMPLE
Instruction status Read Execution Write
Instruction j k Issue operandscompleteResult
LD F6 34+ R2
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
FU
20
Gazipur
SCOREBOARD
EXAMPLE CYCLE 1
Instruction status Read Execution
W rite
LD F6 34+ R2 1
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Integer
21
Gazipur
SCOREBOARD
EXAMPLE CYCLE 2
W rite
LD F6 34+ R2 1 2
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Mult1 No
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Integer
Integer Pipeline Full – Cannot exec 2nd Load – Issue stalls
22
Gazipur
SCOREBOARD
EXAMPLE CYCLE 3
W rite
LD F6 34+ R2 1 2 3
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Mult1 No
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Integer
23
• CSE-4821
Issue stalls
Dr. M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 4
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Mult1 No
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Integer
24
• Issue stalls
Gazipur
SCOREBOARD
EXAMPLE CYCLE 5
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Mult1 No
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Integer
25
InCSE-4821
this cycle the 2nd load is issued.
Gazipur
SCOREBOARD
EXAMPLE CYCLE 6
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6
MULTDF0 F2 F4 6
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 Integer
26
• MULT is issued but has to wait for F2
Gazipur
SCOREBOARD
EXAMPLE CYCLE 7
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6
ADDD F6 F8 F2
Mult2 No
Add Yes Sub F8 F6 F2 Integer Yes No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 Integer Add
27
Now, SUBD can be issued, but has to wait for operand F2 to read.
Gazipur
SCOREBOARD
EXAMPLE CYCLE 8A

W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Mult2 No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 Integer Add Divide
28
• CSE-4821
DIVD is issued but there is another RAW hazard
Gazipur
SCOREBOARD
EXAMPLE CYCLE 8B

W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Integer No
Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Sub F8 F6 F2 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 Add Divide
29
• Load completes, and operands for MULT and SUBD are ready
Gazipur
SCOREBOARD
EXAMPLE CYCLE 9

W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Integer No
10 Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
2 Add Yes Sub F8 F6 F2 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
30
MULT and SUB are sent in execution in parallel
Gazipur
SCOREBOARD
EXAMPLE CYCLE 11
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
31
• CSE-4821
TheDr.SUBD finishes
M. A. Rouf, Dept. of CSE, DUET,
Gazipur
SCOREBOARD
EXAMPLE CYCLE 12

W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Integer No
Mult2 No
Add No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
12 FU Mult1 Divide
32
• Read operands for DIVD?
Gazipur
SCOREBOARD
EXAMPLE CYCLE 13

W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13
Integer No
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
33
• CSE-4821
SUBD writes results and ADDD can be issued
Gazipur
SCOREBOARD
EXAMPLE CYCLE 14
Write
Instruction j k Issueoperands
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVDF10 F0 F6 8
ADDD F6 F8 F2 13 14
Functional unit status dest S1 S2 FU for FU
j for F
k j? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult2 No
2 Add Yes Add F6 F8 F2 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
34
Gazipur
SCOREBOARD
EXAMPLE CYCLE 15
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
35
Gazipur
SCOREBOARD
EXAMPLE CYCLE 16
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
36
Gazipur
SCOREBOARD
EXAMPLE CYCLE 17

W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
37
• CSE-4821
Write result of ADDD? NO, there is a WAR hazard
Gazipur
SCOREBOARD
EXAMPLE CYCLE 18
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
38
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, • Stall, nothing to do
Gazipur
SCOREBOARD
EXAMPLE CYCLE 19
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
39
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, • MULT, execution completed
Gazipur
SCOREBOARD
EXAMPLE CYCLE 20
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Integer No
Mult1 No
Mult2 No
Divide Yes Div F10 F0 F6 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
20 FU Add Divide
40
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, • MULTD write back result
Gazipur
SCOREBOARD
EXAMPLE CYCLE 21
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDD F6 F8 F2 13 14 16
Integer No
Mult1 No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
21 FU Add Divide
41
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, • DIVD, read operand of F0
Gazipur
SCOREBOARD
EXAMPLE CYCLE 22
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDD F6 F8 F2 13 14 16 22
Integer No
Mult1 No
Mult2 No
Add No
40 Divide Yes Div F10 F0 F6 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
22 FU Divide
42
Now DIVD
CSE-4821 Dr. M. A.can read
Rouf, Dept. itsDUET,
of CSE, operands, and ADDD can write the result
Gazipur
SCOREBOARD
EXAMPLE CYCLE 61
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61
ADDD F6 F8 F2 13 14 16 22
Integer No
Mult1 No
Mult2 No
Add No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
61 FU Divide
43
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, • DIVD finishes execution
Gazipur
SCOREBOARD
EXAMPLE CYCLE 62
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61 62
ADDD F6 F8 F2 13 14 16 22
Integer No
Mult1 No
Mult2 No
Add No
0 Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
62 FU
44
Gazipur
CDC 6600 SCOREBOARD
Achieves a speedup of 2.5 w.r.t. no dynamic
scheduling
By reorganizing instructions the compiler
achieves only 1.7
But
• No cache
• No forwarding hardware
• Limited to instructions in a basic block
• Small number of functional units (structural hazards)
• Wait for WAR hazards
• Prevent WAW hazards
45
Gazipur
BRANCH PREDICTION
Current DLX wastes one cycle but other
architectures compute branches several
cycles after the IF stage.
We need to predict ASAP branch result (ID
stage).
Performance of Branch Prediction depends
on:
• Accuracy measured in terms of percentage of
misprediction
• Cost of Misprediction measured in terms of the time
wasted to execute unuseful instructions.
46
Gazipur
BRANCH HISTORY TABLE
Table of 1 bit values
Indexed by the lower bits of the PC address
Says whether or not branch taken last time
47
Gazipur
BRANCH HISTORY TABLE
Problem: in a loop, 1 bit BHT will cause two mispredictions:
1. When we arrive to the end of the loop and we must exit.
Here the BHT predicts to stay in the loop.
2. When we re-enter the loop, we reach the end and we
must stay in the loop. Here the BHT predicts to exit
48
Gazipur
DYNAMIC BRANCH PREDICTION
It is a 2 bit scheme in which we change prediction only if we
get misprediction twice.
For each index of the table, the 2 bits report the state of a
state machine (next slide).
When we arrive at the end of the loop, we dont change
prediction.
49
Gazipur
WE CAN DESCRIBE THE
ALGORITHM WITH A FSM
50
Gazipur
BRANCH HISTORY
TABLE ACCURACY
We have a misprediction when
• We make a wrong guess for that branch
• Because the same index can be referenced by two different
branches, sometimes we get the history of the wrong branch
51
Gazipur
BRANCH HISTORY
TABLE ACCURACY
It has been measured that a 4096 entry
table, programs have a misprediction
percentage from 1% to 18%:
• Nasa7, tomcatv 1%
• Eqntott 18%
• Spice 9%
• Gcc 12%
4096 about as good as infinite table (for the
Alpha 21164)
52
Gazipur
CORRELATING BRANCHES
Basic hypothesis: recent branches are
correlated, i.e., behavior of recently
executed branches affects the prediction of
current branch:
53
Gazipur
CORRELATING
BRANCHES EXAMPLE
subi R3,R1,2
bnez R3,L1
add r1,r0,r0; bb1
If(a==2) bb1;
L1: subi r3,r1,2
L1: If(b==2) bb2;
bnez r3,L2
L2: If(a!=b) bb3;
add r2,r0,r0; bb2
L2: sub r3,r1,r2
beqz r3,L3
...; bb3
L3:
Branch L2 is correlated previous branches.

If both are not taken then L2 is taken.
54
Gazipur
IDEA:
record m most recently executed branches as taken or not
taken. Use that pattern to select the proper branch history
table.
55
Gazipur
EXAMPLE OF A SIMPLE
CORRELATING
PREDICTOR
The branch is predicted on the basis of the
previously executed one by selecting the
appropriate 1 bit BHT.
Branch Prediction Table 1 1 Branch Prediction Table
if last branch taken 0 1 if last branch not taken
.... ....
Branch to be predicted
Last branch result
56
effective branch result
Gazipur
(M,N) PREDICTORS
In general, (m,n) predictor means
record last m branches to select
between 2^m, n-bit history tables.
57
Gazipur
EXAMPLE OF A (2,2)
CORRELATING BRANCH
PREDICTOR
Each cell of the predictor represents the state of a

2 bit branch predictor.
58
Gazipur
ACCURACY OF DIFFERENT
SCHEMES
18%
16%
14%
4096 Entries 2-bit BHT
Unlimited Entries 2-bit BHT
Frequency of Mispredictions
12% 11%
1024 Entries (2,2) BHT
10%
8%
6% 6% 6%
6% 5% 5%
4%
4%
2% 1% 1%
0%
0%
doducd
gcc
nasa7
eqntott
espresso
spice
fpppp
tomcatv
li
matrix300
4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)
59
Gazipur
ADDRESS MUST ALSO
BE PREDICTED
Access in the IF stage the Branch Target Buffer:
Tipical Entry:
Exact Address of a Predicted PC (only if not

branch sequential)
60
Gazipur
BRANCH TARGET
BUFFER STRUCTURE
Pc of fetched instruction
Associative lookup Predicted PC
No, instruction is not predicted

To be a branch, proceed normally
=
Yes, instruction is a a branch,
61
PC should be used as next PC
Gazipur
BRANCH TARGET BUFFER
62
Gazipur
SCOREBOARD SCHEME
IMPLEMENTS THE ILP
LECTURE 3 CONTD..
PROF. DR. M. A. ROUF

DEPT. OF CSE, DUET, GAZIPUR
KEY IDEA OF ILP:
• Allow instructions behind stall to proceed
• Enables out-of-order execution and completion
(commit).
• First implemented in CDC 6600 (1963): The CDC
6600 was the flagship of the 6000 series of mainframe computer
systems manufactured by Control Data Corporation
CDC 6600 (1963)
2
EXAMPLE:
1. DIVD F0,F2,F4
2. ADDD F10,F0,F8
3. SUBD F12,F8,F14
• ADDD surely stalls for F0 (waiting that

DIVD commits).
• DIVD is a floating point operation and
require several cycle to be completed
• SUBD would stall without dynamic
scheduling.
3
DEPENDENCE
1. Data Dependence: True dependence
2. Name dependence: WAR or WAW
3. Control Dependence
• Data Dependency:
• Instruction i produces a result that may be used by
instruction j
• Instruction j is data dependent on instruction k and k is
data dependent on instruction i (transitive dependency)
4
DEPENDENCE
Data Dependence
Name Dependence
5
NAME DEPENDENCE
• Name dependence:
• Name dependence occurs when two or more
instructions use the same register or memory
location, called a name but there is no flow of
data between instructions associated with the
name.
• Two types of name dependence
• Anti dependence: Write After Read (WAR)
• Output dependence: Write After Write (WAW)
6
SCOREBOARD SCHEME 4 STAGES
1. ID stage splitted in two parts:

a) Issue Stage (Decode and check
structural hazards.)
b) Read Operands Stage (Wait until no
data hazards).
2. EX stage: Scoreboard allow instructions
without dependencies to execute out-of-
order.
3. Write result stage
7
• Out-of-order completion -> WAR and
WAW hazards are resolved by:
1. Queue both the operations and copies of its
operands.
2. Read registers only during Read Operands
stage.
8
1. For WAW, the machine stalls write
result until the previous instruction
completes
2. Multiple execution units
3. Scoreboard keeps track of
dependencies and state of
operations.
9
FOUR STAGES OF
SCOREBOARD CONTROL
1. Issue
• Decode instructions and check for structural hazards.
• If a functional unit for the instruction is free and no
other active instruction has the same destination
register (WAW), the scoreboard issues the instruction
to the functional unit and updates its internal data
structure.
• If a structural or a WAW hazard exists, then the
instruction issue stalls, and no further instructions will
issue until these hazards are cleared.
10
FOUR STAGES OF
SCOREBOARD CONTROL
2. Read Operands
a) Wait until no data hazards, then read operands
b) A source operand is available if:
- no earlier issued active instruction will write it or
- A functional unit is writing its value in a register
c) When the source operands are available, the
scoreboard tells the functional unit to proceed to
read the operands from the registers and begin
execution.
d) RAW hazards are resolved dynamically in this step,
and instructions may be sent into execution out of
order.
11
FOUR STAGES OF
SCOREBOARD CONTROL
3. Execution Stage
• Operate on operands
• The functional unit begins execution upon receiving
operands. When the result is ready, it notifies the
scoreboard that it has completed execution.
• FUs are characterized by:
- Latency (the effective time used to complete one
operation).
- Initiation interval (the number of cycles that must
elapse between issuing two operations to the same
functional unit).
12
FOUR STAGES OF
PIPELINE CONTROL
4. Write result
• Finish execution
• Once the scoreboard is aware that the
unit has completed execution, the
scoreboard checks for WAR hazards.
• If none, it writes results.
• If WAR, then write result is stalled.
13
WAR EXAMPLE
Register Renaming
DIVD F0,F2,F4 DIVD F0,F2,F4
ADDD F10,F0,F9
ADDD F10,F0,F8
SUBD F8,F9,F14
SUBD F8,F8,F14
In this case, the scoreboard would stall the SUBD
in the WB stage,waiting that ADDD reads F0 and F8.
14
SCOREBOARD STRUCTURE
1. Instruction status
2. Functional Unit status
Indicates the state of the functional unit (FU):
a) Busy – Indicates whether the unit is busy or not
b) Op - The operation to perform in the unit (+,-, etc.)
c) Fi - Destination register
d) Fj, Fk – Two Source Register Numbers
e) Qj, Qk – Functional units producing source registers
f) Rj, Rk – Flags indicating when Fj, Fk are ready
3. Register result status.
• Indicates which functional unit will write each register.
Blank if no pending instructions will write that register.
15
ASSUMPTIONS FOR
EXAMPLE
1. Load/store unit :
a) Address calculation unit :1
b) Latency :1 cycle
2. ALU/Integer unit : Execution latency 2 cycles
3. Floating point MultD: Execution latency 10
cycles
4. Floating point DivD: Execution latency 40
cycles
16
SCOREBOARD EXAMPLE
Write
Instruction j k Issue operands
complete
Result
LD F6 34+ R2
LD F2 45+ R3
MULTD F0 F2 F4
SUBDF8 F6 F2
DIVDF10 F0 F6
ADDDF6 F8 F2
j for Fk j? Fk?
Integer No
Mult1 No
Mult2 No
Add No
Divide No
17
SCOREBOARD
EXAMPLE CYCLE 1
W rite
LD F6 34+ R2 1
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Mult1 No
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Integer
18
SCOREBOARD
EXAMPLE CYCLE 2
W rite
LD F6 34+ R2 1 2
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Mult1 No
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Integer
Integer Pipeline Full – Cannot exec 2nd Load – Issue stalls
19
SCOREBOARD
EXAMPLE CYCLE 3
W rite
LD F6 34+ R2 1 2 3
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Mult1 No
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Integer
20
• Issue stalls
SCOREBOARD
EXAMPLE CYCLE 4
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Mult1 No
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Integer
21
• Issue stalls due to single LD/ST unit and single address
calculation unit
SCOREBOARD
EXAMPLE CYCLE 5
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Mult1 No
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Integer
22
In this cycle the 2nd load is issued.
SCOREBOARD
EXAMPLE CYCLE 6
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6
MULTDF0 F2 F4 6
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Mult2 No
Add No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 Integer
• MULT is issued but has to wait for F2
23
SCOREBOARD
EXAMPLE CYCLE 7
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6
ADDD F6 F8 F2
Mult2 No
Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 Integer Add
24
• Now, SUBD can be issued, but has to wait for operand F2 to read.
SCOREBOARD
EXAMPLE CYCLE 8A

W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 Integer Add Divide
25
• DIVD is issued but there is another RAW hazard for F0
SCOREBOARD
EXAMPLE CYCLE 8B

W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Integer No
Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Sub F8 F6 F2 Yes Yes
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
26
• Load completes, and operands for MULTD and SUBD are ready
SCOREBOARD
EXAMPLE CYCLE 9

W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
• MULT and SUBD Read operands and are sent for
27
execution in parallel
SCOREBOARD
EXAMPLE CYCLE 11
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
28
• The SUBD finishes execution
SCOREBOARD
EXAMPLE CYCLE 12

W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Integer No
Mult2 No
Add No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
12 FU Mult1 Divide
29
• Read operands for DIVD: Can not read F0 before MULTD
writes F0
SCOREBOARD
EXAMPLE CYCLE 13

W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
30
• SUBD writes results and ADDD can be issued
SCOREBOARD
EXAMPLE CYCLE 14
Write
Instruction j k Issueoperands
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVDF10 F0 F6 8
ADDD F6 F8 F2 13 14
j for F
k j? Fk?
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
31
• ADDD can read operands
SCOREBOARD
EXAMPLE CYCLE 15
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
32
• ADDD can executes on operands
SCOREBOARD
EXAMPLE CYCLE 16
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
33
• ADDD finishes execution
SCOREBOARD
EXAMPLE CYCLE 17

W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
34
• Write result of ADDD? NO, there is a WAR hazard for F6
SCOREBOARD
EXAMPLE CYCLE 18
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
35
• Stall continued, nothing to do
SCOREBOARD
EXAMPLE CYCLE 19
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Integer No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
36
• MULTD, execution completed
SCOREBOARD
EXAMPLE CYCLE 20
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Integer No
Mult1 No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
20 FU Add Divide
37
• MULTD write back result
SCOREBOARD
EXAMPLE CYCLE 21
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDD F6 F8 F2 13 14 16
Integer No
Mult1 No
Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
21 FU Add Divide
38
• DIVD, read operand of F0
SCOREBOARD
EXAMPLE CYCLE 22
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21
ADDD F6 F8 F2 13 14 16 22
Integer No
Mult1 No
Mult2 No
Add No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
22 FU Divide
• Now DIVD can execute on its operands, and ADDD can write
39
the result
SCOREBOARD
EXAMPLE CYCLE 61
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61
ADDD F6 F8 F2 13 14 16 22
Integer No
Mult1 No
Mult2 No
Add No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
61 FU Divide
40
• DIVD finishes execution after 40 cycles
SCOREBOARD
EXAMPLE CYCLE 62
W rite
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61 62
ADDD F6 F8 F2 13 14 16 22
Integer No
Mult1 No
Mult2 No
Add No
0 Divide No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
62 FU
41
• DIVD finishes after writing back results
Lecture 3.3
Tomasulo’s Algorithm
Dynamic Scheduling Using Tomasulo’s
Algorithm
• It is used in IBM 360/91 floating point unit
• Invented by Robert Tomasulo of IBM
• It keeps tracks whenever operands are available
• It minimizes RAW hazards
• It introduces register renaming to minimize WAW
and RAW hazards
CSE-4821 Tomasulo's Algorithm 2

Hardware Speculation (“Boosting”)
• Issue an instruction dependent on branch before the branch result is

known.
• Commit is always made in order.
• Commit of a speculative instruction is made only when the branch
outcome is known.
• The same holds for exceptions (synchronous or asynchronous)
deviations of control flow

Speculative Tomasulo’s Algorthm
• Tomasulo’s “Boosting” needs a buffer for uncommited

results which is called reorder buffer (ROB).
• Each entry is:
Instruction Destination Value
• ROB has a slot for each issued instruction.

• When an instruction writes into a register, it writes only in its
assigned slot in the ROB.
• The reorder buffer can be a operand source
• The reservation station (RS) or load buffers
• Destination like register file (RF) and store buffers

Tomasulo’s ROB (cont.)
•Reservation station (RS) now only queue

instructions to FUs (to reduce structural
hazards)
•Pointers, now, are directed toward ROB
slots.
•It has a common data bus (CDB) for
forwarding common results to those
waiting for operands
Four steps of speculative Tomasulo’s Algorithm
1. Issue: get an instruction from the queue. RS &&

ROB must have a free slot. Dispatch the
operation indicating in which slot it must write
2. Execution: When both operands are ready, it is
executed. If not watch in the CDB.
3. Write Result:Write on CDB and on ROB
4. Commit: the commited instruction at head of
the ROB updates destination register and is
removed. Mispredicted branches flush the ROB
(“graduation”).
Speculative Tomasulo’s algorithm

Tomasulo Algorithm
•Invented at IBM 3 years after CDC 6600 for

the IBM 360/91
•Same Goal: performance without special
compilers
•Lead to:
• Alpha 21264, HP 8000, MIPS 10000, Pentium II,
PowerPC 604

Tomasulo Algorithm Basics
The control logic and the buffers are

distributed with FUs
Operand buffers are called reservation
stations.
Each instruction is an entry of a reservation
station.
Its operands are replaced by values or
pointers (Register Renaming)

Tomasulo Algorithm Basics
Register Renaming allows to:
Avoid WAR and WAW hazards
Reservation stations are more than
registers (so can do better optimizations
than a compiler).
Results are dispatched to other FUs through
a Common Data Bus
Load/Stores treated as FUs

Tomasulo Algorithm for an FPU

Reservation Station Components
•Tag identifying the RS
•OP=the operation to perform on the
component.
•Vj, Vk=Value of the source operands
•Qj,Qk=Pointers to RS that produce Vj,Vk
•Busy=Indicates RS Busy

Other components
RF and the Store buffer have a Value
(V) and a Pointer (Q) field.
Load buffers have an address field, and
a busy field.
Store Buffers have also an address
field.

The three stages of the Tomasulo Algorithm.
 ISSUE.
 Get an instruction I from the queue. If it is an FP op.
Check if an RS is empty (i.e., check for structural
hazards).
 Rename registers;
 WAR resolution: For instruction I and J. I is an
instruction which reads Rx, then if J writes Rx. J
already knows the value of Rx or knows what
instruction will write it. So the RF can be linked to I.
 WAW resolution: Since we use in-order issue, the
RF can be linked to I.

The Three Stages of The Tomasulo
Algorithms
Execution:
When both operands are ready then
execute. If not ready, watch the common
data bus fo results
Write result
Write on Common Data Bus (CDB) to all
waiting units; mark reservation stations
available.

The Common Data Bus
•A common data bus is a data+source
bus.
•In the IBM 360/91
Data=64 bits, Source=4 bits
•FU must perform associative lookup in
the RS.

Tomasulo (IBM) versus Scoreboard (CDC)
1. Multiple but not
1. Pipelined Fus
pipelined Fus
2. Issue window size=14
2. Issue window size=5
3. No issue on structural
3. No issue on structural
hazards
hazards
4. WAR, WAW avoided with
4. Stall the completion for
renaming
WAW and WAR hazards
5. Broadcast results from FU
5. Results written back on
6. Control distributed on RS registers.
6. Control centralized
through the Scoreboard.

Tomasulo (IBM) versus Scoreboard (CDC)
Tomasulo (IBM) Scoreboard (CDC)

1. Multiple but not pipelined
1. Pipelined FUs FUs
2. Issue window size=14 2. Issue window size=5
3. No issue on structural 3. No issue on structural
hazards hazards
4. WAR, WAW avoided with 4. Stall the completion for
renaming WAW and WAR hazards
5. Broadcast results from FU 5. Results written back on
registers.
6. Control distributed on RS
6. Control centralized through
the Scoreboard.

Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op

Queue
a) Stall if structural hazard, i.e. no space in the rs.
b) If reservation station (rs) is free, the issue logic
issues instr to rs & read operands into rs if
ready (Register renaming => Solves WAR).
c) Make status of destination register waiting for
this latest instr even if the previous instr
writing to this register hasn’t completed =>
Solves WAW hazards.

2. Execution—operate on operands (EX)

When both operands are ready then execute;
if not ready, watch CDB for result – Solves RAW
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting units;
mark reservation station available. Write result
into dest. reg. if its status is r. => Solves WAW.

•Normal data bus: data + destination

(“go to” bus)
•CDB: data + source (“come from” bus)
•64 bits of data + 4 bits of Functional Unit
source address
•Write if matches expected Functional Unit
(produces result)
•Does broadcast

Reservation Station Components
Op—Operation to perform in the unit (e.g., + or –)
Vj, Vk— Value of the source operand.
Qj, Qk— Name of the RS that would provide the source operands.
Value zero means the source operands already available in Vj or
Vk, or is not necessary.
Busy—Indicates reservation station or FU is busy
Register File Status Qi:

Qi —Indicates which functional unit will write each register, if one
exists. Blank (0) when no pending instructions that will write that
register meaning that the value is already available.

Tomasulo Example Cycle 0
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 Load1 No
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
0 Add2 No
Add3 No
0 Mult1 No
0 Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
0 FU
LD F6 34+ R2 1 Load1 Yes 34+R2
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
0 Add1 No
0 Add2 No
Add3 No
0 Mult1 No
0 Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Load1
LD F6 34+ R2 1 2- Load1 Yes 34+R2
LD F2 45+ R3 2 Load2 Yes 45+R3
SUBD F8 F6 F2 Assume Load takes 2 cycles
DIVD F10 F0 F6
ADDD F6 F8 F2
0 Add1 No
0 Add2 No
Add3 No
0 Mult1 No
0 Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Load2 Load1
LD F6 34+ R2 1 2--3 Load1 Yes 34+R2
LD F2 45+ R3 2 3- Load2 Yes 45+R3
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
0 Add1 No
0 Add2 No read value
Add3 No
0 Mult1 Yes Mult R(F4) Load2
0 Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Mult1 Load2 Load1
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 Load2 Yes 45+R3
SUBD F8 F6 F2 4
DIVD F10 F0 F6
ADDD F6 F8 F2
0 Add1 Yes Sub M(A1) Load2
0 Add2 No
Add3 No
0 Mult1 Yes Mult R(F4) Load2
0 Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Mult1 Load2 M(A1) Add1
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
SUBD F8 F6 F2 4
DIVD F10 F0 F6 5
ADDD F6 F8 F2
2 Add1 Yes Sub M(A1) M(A2)
0 Add2 No
Add3 No
10 Mult1 Yes Mult M(A2) R(F4)
0 Mult2 Yes Div M(A1) Mult1
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Mult1 M(A2) M(A1) Add1 Mult2
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- Load3 No
SUBD F8 F6 F2 4 6 --
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
0 Add2 Yes Add M(A2) Add1
Add3 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 M(A2) Add2 Add1 Mult2
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
SUBD F8 F6 F2 4 6 -- 7
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
0 Add2 Yes Add M(A2) Add1
Add3 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 M(A2) Add2 Add1 Mult2
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
0 Add1 No
2 Add2 Yes Add M1-M2 M(A2)
Add3 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 M(A2) Add2 M1-M2 Mult2
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 --
0 Add1 No
Add3 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 -- 10
0 Add1 No
Add3 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 -- 10 11
0 Add1 No
Add2 No
Add3 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
11 FU Mult1 M(A2) M1-M2+M(A2)
M1-M2 Mult2
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 -- 10 11
0 Add1 No
Add2 No
Add3 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
M1-M2 Mult2
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- 15 Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 -- 10 11
0 Add1 No
Add2 No
Add3 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
M1-M2 Mult2
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- 15 16 Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 -- 10 11
0 Add1 No
Add2 No
Add3 No
Mult1 No
40 Mult2 Yes Div M*F4 M(A1)
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
16 FU M*F4 M(A2) M1-M2+M(A2)
M1-M2 Mult2
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- 15 16 Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5 17 -- 56
ADDD F6 F8 F2 6 9 -- 10 11
0 Add1 No
Add2 No
Add3 No
Mult1 No
0 Mult2 Yes Div M*F4 M(A1)
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
56 FU M*F4 M(A2) M1-M2+M(A2)
M1-M2 Mult2
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- 15 16 Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5 17 -- 56 57
ADDD F6 F8 F2 6 9 -- 10 11
0 Add1 No
Add2 No
Add3 No
Mult1 No
0 Mult2 No
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
57 FU M*F4 M(A2) M1-M2+M(A2)
M1-M2 result
CSE-4821
Advanced Computer Architecture
L3.4: VLIW Architecture
Dr. M. A. Rouf
Dept. of CSE, DUET, Gazipur
Basic Working Principles of VLIW
• Aim at speeding up computation by exploiting
instruction-level parallelism.
• Same hardware core as superscalar processors, having
multiple execution units (EUs) working in parallel.
• An instruction is consisted of multiple operations;
typical word length from 52 bits to 1 Kbits.
• All operations in an instruction are executed in a lock-
step mode.
• One or multiple register files for FX and FP data.
• Rely on compiler to find parallelism and schedule
dependency free program code.
CSE-4821, L3.4 VLIW Architecture for ILP,

2
Dr. M. A. Rouf, Dept. of CSE, DUET
Comparison of VLIW, CISC,RISC

3
Generating of VLIW instruction words
A hypothetical VLIW processor architecture

4
Basic VLIW Approach

5
Register File Structure for VLIW
• What is the challenge to register file in VLIW?

R/W ports
6
Differences Between VLIW &
Superscalar Architecture (I)

7
Superscalar Architecture (II)
• Instruction formulation:
– Superscalar:
• Receive conventional instructions conceived for seq. processors.
– VLIW:
• Receive (very) long instruction words, each comprising a field (or
opcode) for each execution unit.
• Instruction word length depends (a) number of execution units, and (b)
code length to control each unit (such as opcode length, register
names, …).
• Typical word length is 64 – 1024 bits, much longer than conventional
machine word length.

8
Superscalar Architecture (III)
• Instruction scheduling:
– Superscalar:
• Done dynamically at run-time by the hardware.
• Data dependency is checked and resolved in hardware.
• Need a lookahead hardware window for instruction fetch.

9
Superscalar Architecture (IV)
• Instruction scheduling (cont’d):
– VLIW:
• Static scheduling done at compile-time by the compiler.
• Advantages:
– Reduce hardware complexity.
– Tasks such as decoding, data dependency detection,
instruction issue, …, etc. becoming simple.
– Potentially higher clock rate.
– Higher degree of parallelism with global program
information.

10
Superscalar Architecture (V)
• Instruction scheduling (cont’d):
– VLIW:
• Disadvantages
– Higher complexity of the compiler.
– Compiler optimization needs to consider technology
dependent parameters such as latencies and load-use
time of cache.
(Question: What happens to the software if the hardware
is updated?)
– Non-deterministic problem of cache misses, resulting in
worst case assumption for code scheduling.
– In case of un-filled opcodes in a (V)LIW, memory space
and instruction bandwidth are wasted.
11
Development history of
Proposed/Commercial VLIWs

12
Case Study of VLIW: Trace 200 Family (I)

13
Case Study of VLIW: Trace 200 Family (II)
• Only two branches might be used in Trace 7/2000

14
Code Expansion in VLIW
• It is found that code in VLIW is expanded roughly by
a factor of three.
• For “long” VLIW, more opcode fields will be emptied.
This will result in wasting bandwidth and storage
space. Can you propose a solution for it?

15
END

16
Compiler techniques
for exposing ILP
CSE-4821 Advanced Computer Architecture

Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur
Instruction Level Parallelism
• Potential overlap among instructions
• Few possibilities in a basic block
– Blocks are small (6-7 instructions)
– Instructions are dependent
• Goal: Exploit ILP across multiple basic
blocks
– Iterations of a loop
for (i = 1000; i > 0; i=i-1)
x[i] = x[i] + s;
CSE-4821 Compiler Techniques 2

for ILP, by Dr. M. A. Rouf
Basic Scheduling
Sequential MIPS Assembly Code
for (i = 1000; i > 0; i=i-1) Loop: LD F0, 0(R1)
ADDD F4, F0, F2
x[i] = x[i] + s; SD 0(R1), F4
SUBI R1, R1, #8
BNEZ R1, Loop
Pipelined execution:
Loop: LD F0, 0(R1) 1 Scheduled pipelined execution:
stall 2 Loop: LD F0, 0(R1) 1
ADDD F4, F0, F2 3 SUBI R1, R1, #8 2
stall 4 ADDD F4, F0, F2 3
stall 5 stall 4
SD 0(R1), F4 6 BNEZ R1, Loop 5
SUBI R1, R1, #8 7 SD 8(R1), F4 6
stall 8
BNEZ R1, Loop 9
stall 10
Loop Unrolling
Unrolling 4 Times
for (i = 1000; i > 0; i=i-4)
{
x[i] = x[i] + s;
x[i-1] = x[i-1] + s;
x[i-2] = x[i-2] + s;
x[i-3] = x[i-3] + s;
}

Loop Unrolling
Loop: LD F0, 0(R1)
ADDD F4, F0, F2
SD 0(R1), F4
Pros: SUBI R1, R1, #8
Larger basic block BEQZ R1, Exit
More scope for scheduling LD F6, 0(R1)
ADDD F8, F6, F2
and eliminating dependencies SD 0(R1), F8
SUBI R1, R1, #8
Cons: BEQZ R1, Exit
Increases code size LD F10, 0(R1)
ADDD F12, F10, F2
SD 0(R1), F12
Comment: SUBI R1, R1, #8
Often a possibility for BEQZ R1, Exit
other optimizations LD F14, 0(R1)
ADDD F16, F14, F2
SD 0(R1), F16
SUBI R1, R1, #8
BNEZ R1, Loop
Exit:

Loop Transformations
• Instruction independency is the key
requirement for the transformations
• Example
– Determine that is legal to move SD after SUBI and
BNEZ
– Determine that unrolling is useful (iterations are
independent)
– Use different registers to avoid unnecessary constrains
– Eliminate extra tests and branches
– Determine that LD and SD can be interchanged
– Schedule the code, preserving the semantics of the
code for ILP, by Dr. M. A. Rouf
1. Eliminating Name Dependences
Loop: LD F0, 0(R1) Loop: LD F0, 0(R1)
ADDD F4, F0, F2 ADDD F4, F0, F2
SD 0(R1), F4 Rename F0 SD 0(R1), F4
LD F0, -8(R1) register in LD LD F6, -8(R1)
to remove
dependency
SD -8(R1), F4 SD -8(R1), F8
LD F0, -16(R1) LD F10, -16(R1)
ADDD F4, F0, F2 Register Renaming ADDD F12, F10, F2
SD -16(R1), F4 SD -16(R1), F12
LD F0, -24(R1) LD F14, -24(R1)
SD -24(R1), F4 SD -24(R1), F16
SUBI R1, R1, #32 SUBI R1, R1, #32
BNEZ R1, Loop CSE-4821 Compiler Techniques BNEZ R1, Loop 7
2. Eliminating Control Dependences
Loop: LD F0, 0(R1)
ADDD F4, F0, F2 Intermediate BEQZ are never taken
SD 0(R1), F4
SUBI R1, R1, #8 Eliminate!
BEQZ R1, Exit
LD F6, 0(R1)
ADDD F8, F6, F2
SD 0(R1), F8
SUBI R1, R1, #8
BEQZ R1, Exit
LD F10, 0(R1)
ADDD F12, F10, F2
SD 0(R1), F12
SUBI R1, R1, #8
BEQZ R1, Exit
LD F14, 0(R1)
ADDD F16, F14, F2
SD 0(R1), F16
SUBI R1, R1, #8
BNEZ R1, Loop CSE-4821 Compiler Techniques 8
Exit: for ILP, by Dr. M. A. Rouf
3. Eliminating Data Dependences
Loop: LD F0, 0(R1)
ADDD F4, F0, F2 • Data dependencies SUBI, LD, SD
SD 0(R1), F4 Force sequential execution of iterations
SUBI R1, R1, #8
• Compiler removes this dependency by:
LD F6, 0(R1)
ADDD F8, F6, F2
Computing intermediate R1 values
SD 0(R1), F8 Eliminating intermediate SUBI
SUBI R1, R1, #8 Changing final SUBI
LD F10, 0(R1)
• Data flow analysis
ADDD F12, F10, F2
SD 0(R1), F12
Can do on Registers
SUBI R1, R1, #8 Cannot do easily on memory locations
LD F14, 0(R1) 100(R1) = 20(R2)
ADDD F16, F14, F2
SD 0(R1), F16
SUBI R1, R1, #8
BNEZ R1, Loop for ILP, by Dr. M. A. Rouf
4. Alleviating Data Dependencies
Unrolled loop: Scheduled Unrolled loop:
Loop: LD F0, 0(R1) Loop: LD F0, 0(R1)
ADDD F4, F0, F2 LD F6, -8(R1)
SD 0(R1), F4 LD F10, -16(R1)
LD F6, -8(R1) LD F14, -24(R1)
SD -8(R1), F8 ADDD F8, F6, F2
LD F10, -16(R1) ADDD F12, F10, F2
SD -16(R1), F12 SD 0(R1), F4
LD F14, -24(R1) SD -8(R1), F8
ADDD F16, F14, F2 SUBI R1, R1, #32
SD -24(R1), F16 SD 16(R1), F12
SUBI R1, R1, #32 BNEZ R1, Loop
BNEZ R1, Loop SD 8(R1), F16

Some General Comments
• Dependences are a property of programs
• Actual hazards are a property of the pipeline
• Techniques to avoid dependence limitations
– Maintain dependences but avoid hazards
• Code scheduling
– hardware
– software
– Eliminate dependences by code transformations
• Complex
• Compiler-based

Loop-level Parallelism
• Primary focus of dependence analysis
• Determine all dependences and find cycles
for (i=1; i<=100; i=i+1) {
x[i] = y[i] + z[i];
w[i] = x[i] + v[i];
}
for (i=1; i<=100; i=i+1) {
x[i+1] = x[i] + z[i]; x[1] = x[1] + y[1];
} for (i=1; i<=99; i=i+1) {
y[i+1] = w[i] + z[i];
for (i=1; i<=100; i=i+1) {
x[i+1] = x[i +1] + y[i +1];
x[i] = x[i] + y[i]; }
y[i+1] = w[i] + z[i]; y[101] = w[100] + z[100];
}
Dependence Analysis Algorithms
• Assume array indexes are affine (ai + b)

– GCD test:
For two affine array indexes ai+b and ci+d:
if a loop-carried dependence exists, then GCD (c,a) must
divide (d-b)
x[8*i ] = x[4*i + 2] +3
(2-0)/GCD(8,4)
• General graph cycle determination is NP
• a, b, c, and d may not be known at compile time

Software Pipelining
Start-up
Finish-up
Iteration 0 Iteration 1 Iteration 2 Iteration 3
Software pipelined iteration CSE-4821 Compiler Techniques 14

Example
Iteration i Iteration i+1 Iteration i+2
LD F0, 0(R1)
ADDD F4, F0, F2 LD F0, 0(R1)
SD 0(R1), F4 ADDD F4, F0, F2 LD F0, 0(R1)
SD 0(R1), F4 ADDD F4, F0, F2
SD 0(R1), F4
Loop: LD F0, 0(R1) Loop: SD 16(R1), F4

SD 0(R1), F4 LD F0, 0(R1)
SUBI R1, R1, #8 SUBI R1, R1, #8
BNEZ R1, Loop BNEZ R1, Loop
Trace (global-code)
Scheduling
• Find ILP across conditional branches
• Two-step process
– Trace selection
• Find a trace (sequence of basic blocks)
• Use loop unrolling to generate long traces
• Use static branch prediction for other conditional
branches
– Trace compaction
• Squeeze the trace into a small number of wide
instructions
• Preserve data and control dependences
Trace Selection
A[I] = A[I] + B[I] LW R4, 0(R1)
LW R5, 0(R2)
T F ADD R4, R4, R5
A[I] = 0?
SW 0(R1), R4
BNEZ R4, else
B[I] = X ....
SW 0(R2), . . .
J join
Else: ....
X
C[I] = Join: ....
for ILP, by Dr. M. A. Rouf SW 0(R3), . . .
Summary of Compiler
Techniques
• Try to avoid dependence stalls
• Loop unrolling
– Reduce loop overhead
• Software pipelining
– Reduce single body dependence stalls
• Trace scheduling
– Reduce impact of other branches
• Compilers use a mix of three
• All techniques depend on prediction
accuracy
CSE 4821 Advanced Computer Architecture
Thread Level Parallelism
Dr. M. A. Rouf
Dept. of CSE
DUET, Gazipur
Performance beyond single thread ILP
• There can be much higher natural parallelism in
some applications
(e.g., Database or Scientific codes)
• Explicit Thread Level Parallelism or Data Level
Parallelism
• Thread: process with own instructions and data
• thread may be a process part of a parallel program of multiple
processes, or it may be an independent program
• Each thread has all the state (instructions, data, PC, register state, and
so on) necessary to allow it to execute
• Data Level Parallelism: Perform identical operations
on data, and lots of data
CSE-4821 Dr. M. A. Rouf, Dept. of

2
CSE, DUET, Gazipur
Thread Level Parallelism (TLP)
• ILP exploits implicit parallel operations within a
loop or straight-line code segment
• TLP explicitly represented by the use of multiple
threads of execution that are inherently parallel
• Goal: Use multiple instruction streams to
improve
1. Throughput of computers that run many programs
2. Execution time of multi-threaded programs
• TLP could be more cost-effective to exploit than
ILP

3
CSE, DUET, Gazipur
New Approach: Mulithreaded Execution
• Multithreading: multiple threads to share the functional
units of 1 processor via overlapping
• processor must duplicate independent state of each thread e.g., a separate
copy of register file, a separate PC, and for running independent programs, a
separate page table
• memory shared through the virtual memory mechanisms, which already
support multiple processes
• HW for fast thread switch; much faster than full process switch  100s to
1000s of clock cycles
• When switch?
• Alternate instruction per thread (fine grain) in each cycle
• When a thread is stalled, perhaps for a cache miss, another thread can be
executed (coarse grain)

4
CSE, DUET, Gazipur
Fine-Grained Multithreading
• Switches between threads on each instruction, causing the execution

of multiples threads to be interleaved
• Usually done in a round-robin fashion, skipping any stalled threads
• CPU must be able to switch threads every clock
• Advantage: It is it can hide both short and long stalls, since
instructions from other threads executed when one thread stalls
• Disadvantage: it slows down execution of individual threads, since a
thread ready to execute without stalls will be delayed by instructions
from other threads
• Used on Sun’s Niagara (will see later)

5
CSE, DUET, Gazipur
Coarse-Grained Multithreading
• Switches threads only on costly stalls, such as L2 cache misses
• Advantages
• Relieves need to have very fast thread-switching
• Doesn’t slow down thread, since instructions from other threads
issued only when the thread encounters a costly stall
• Disadvantage is hard to overcome throughput losses from shorter
stalls, due to pipeline start-up costs
• Since CPU issues instructions from 1 thread, when a stall occurs, the
pipeline must be emptied or frozen
• New thread must fill pipeline before instructions can complete
• Because of this start-up overhead, coarse-grained multithreading is
better for reducing penalty of high cost stalls, where pipeline refill <<
stall time
• Used in IBM AS/400

6
CSE, DUET, Gazipur
For most apps, most execution units lie idle
For an 8-way
superscalar.
From: Tullsen,
Eggers, and Levy,
“Simultaneous
Multithreading:
Maximizing On-chip
Parallelism, ISCA
1995.
7
CSE, DUET, Gazipur
Do both ILP and TLP?
• TLP and ILP exploit two different kinds of parallel

structure in a program
• Could a processor oriented at ILP to exploit TLP?
• functional units are often idle in data path designed for ILP because
of either stalls or dependences in the code
• Could the TLP be used as a source of independent
instructions that might keep the processor busy
during stalls?
• Could TLP be used to employ the functional units
that would otherwise lie idle when insufficient ILP
exists?

8
CSE, DUET, Gazipur
Simultaneous Multi-threading ...
One thread, 8 units Two threads, 8 units
Cycle M M FX FX FP FP BR CC Cycle M M FX FX FP FP BR CC
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes

CSE-4821 Dr. M. A. Rouf, Dept. of 9
CSE, DUET, Gazipur
Simultaneous Multithreading (SMT)
• Simultaneous multithreading (SMT): insight that dynamically
scheduled processor already has many HW mechanisms to support
multithreading
• Large set of virtual registers that can be used to hold the register sets of
independent threads
• Register renaming provides unique register identifiers, so instructions from
multiple threads can be mixed in datapath without confusing sources and
destinations across threads
• Out-of-order completion allows the threads to execute out of order, and
get better utilization of the HW
• Just adding a per thread renaming table and keeping separate PCs
• Independent commitment can be supported by logically keeping a separate
reorder buffer for each thread
Source: Micrprocessor Report, December 6, 1999

“Compaq Chooses SMT for Alpha”
10
CSE, DUET, Gazipur
Multithreading Categories FUs: 1 2 3 4
Pipes: 1 2 3 4 New Thread/cyc Many Cyc/thread Separate Jobs Simultaneous
Superscalar Fine-Grained Coarse-Grained Multiprocessing Multithreading
Time (processor cycle)
16/48 = 33.3% 27/48 = 56.3% 27/48 = 56.3% 29/48 = 60.4% 42/48 = 87.5%
Thread 1 Thread 3 Thread 5
Thread 2 Thread 4 Idle slot
11
CSE, DUET, Gazipur
Design Challenges in SMT
• Since SMT makes sense only with fine-grained implementation,
impact of fine-grained scheduling on single thread performance?
• A preferred thread approach sacrifices neither throughput nor single-thread
performance?
• Unfortunately, with a preferred thread, the processor is likely to sacrifice some
throughput, when preferred thread stalls
• Larger register file needed to hold multiple contexts
• Not affecting clock cycle time, especially in
• Instruction issue - more candidate instructions need to be considered
• Instruction completion - choosing which instructions to commit may be
challenging
• Ensuring that cache and TLB conflicts generated by SMT do not
degrade performance
CSE-4821 Dr. M. A. Rouf, Dept. of CSE, DUET, Gazipur 12

1 Rouf

Uploaded by

Copyright:

Available Formats

1 Rouf

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 Rouf

Uploaded by

Copyright:

Available Formats

Case Study: Intel Processors

Courtesy: Intel Corp.

By: Dr. M. A. Rouf

CMOS VLSI Design

CMOS VLSI Design

CMOS VLSI Design

CMOS VLSI Design

CMOS VLSI Design

CMOS VLSI Design

CMOS VLSI Design

CMOS VLSI Design

CMOS VLSI Design

CMOS VLSI Design

CMOS VLSI Design

CMOS VLSI Design

CMOS VLSI Design

CMOS VLSI Design

CMOS VLSI Design

CMOS VLSI Design

New models for performance:

• Magnetic disk technology

INSTRUCTION SET PRINCIPLES, PIPELINE HAZARDS

PROF. DR. M. A. ROUF

Instruction set design issues include:

Separation of Programming Model

High-level Language Based Concept of a Family

Complex Instruction Sets Load/Store Architecture

Stack (1960s to 1970s):

Memory-Memory (1970s to 1980s):

Register-Memory (1970s to present, e.g. 80x86):

Register-Register (Load/Store) (1960s to present, e.g. MIPS):

Big Endian 0x100 0x101 0x102 0x103

Operation SPECint92 SPECfp92

• Design hardware to handle branches quickly,

• All instructions 32 bits wide

IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB

IF/ ID/ EX/ Mem/

add R1, R2, R3 add R1, R2, R3 add R1, R2, R3

True dependence anti dependence output dependence

Where (How) do WAR and WAW hazards occur ?

The penalty when branch take is 3 cycles!

CSE-4821 by Professor Dr. M. A. Rouf 3

CSE-4821 by Professor Dr. M. A. Rouf 4

CSE-4821 by Professor Dr. M. A. Rouf 5

CSE-4821 by Professor Dr. M. A. Rouf 6

CSE-4821 by Professor Dr. M. A. Rouf 8

CSE-4821 by Professor Dr. M. A. Rouf 9

CSE-4821 by Professor Dr. M. A. Rouf 10

CSE-4821 by Professor Dr. M. A. Rouf 11

CSE-4821 by Professor Dr. M. A. Rouf 12

CSE-4821 by Professor Dr. M. A. Rouf 13

lw $r0, 10($r1) IM REG ALU DM REG

sw $r3, 20($r4) IM REG ALU DM REG

add $r5, $r6, $r7 IM REG ALU DM REG

sub $r8, $r9, $r10 IM REG ALU DM REG

CSE-4821 by Professor Dr. M. A. Rouf 14

lw $r0, 10($r1) IM REG ALU DM REG

sw $r3, 20($r4) IM REG ALU DM REG

add $r5, $r6, $r7 IM REG ALU DM REG