1. Instruction Level Parallelism –
Compiler Techniques
CS4342 Advanced Computer Architecture
Dilum Bandara
Dilum.Bandara@uom.lk
Slides adapted from “Computer Architecture, A Quantitative Approach” by
John L. Hennessy and David A. Patterson, 5th Edition, 2012, Morgan
Kaufmann Publishers
3. Source: www.cse.wustl.edu/~jain/cse567-11/ftp/multcore/
Instruction Level Parallelism
Overlap execution of instructions to improve
performance
2 approaches
1. Rely on software techniques to find parallelism,
statically at compile-time
E.g., Itanium 2, ARM Cortex-A9
2. Rely on hardware to help discover & exploit
parallelism dynamically
E.g., Pentium 4, AMD Opteron, IBM Power
3
5. Instruction Level Parallelism
Basic Block (BB) ILP is quite small
BB – straight-line code sequence with no branches in
except to entry & no branches out except at exit
Average dynamic branch frequency 15% to 25%
4 to 7 instructions execute between a pair of branches
Also, instructions in BB likely to depend on each other
5
6. Loop Level Parallelism
for(i = 1; i <= 1000; i = i+1)
x[i] = x[i] + y[i];
Each iteration is independent
Exploit loop-level parallelism to parallelize by
“unrolling loop” either by
1. Static via loop unrolling by compiler
Another way is vectors & GPUs, to be covered later
2. Dynamic via branch prediction
Determining instruction dependence is critical to
loop-level parallelism
6
8. Data Dependences
Instruction J tries to read operand before Instruction I
writes it
Caused by a dependence
This hazard results from an actual need for
communication – True dependence
Instruction K depends on instruction J, & J depends on I
K J I
Is add r1, r1, r2 a dependence?
8
9. Data Dependences (Cont.)
Program order must be preserved
Dependences are a property of programs
Data dependencies
Indicates potential for a hazard
But actual hazard & length of any stall is a property of the
pipeline
Determines order in which results must be calculated
Sets an upper bound on how much parallelism can
possibly be exploited
Goal
Exploit parallelism by preserving program order only
where it affects outcome of the program 9
10. Name Dependences
When 2 instructions use same register or
memory location (called a name), but no flow of
data between instructions associated with that
name
Because of use of same name
10
11. 2 Types of Name Dependence
1. Antidependence
Instruction J writes operand before instruction I reads
it
Problem caused by use of name r1
Write After Read (WAR) hazard
Not a problem in MIPS 5-stage pipeline
11
12. 2 Types of Name Dependence (Cont.)
2. Output dependence
Instruction J writes operand before instruction I writes
it
Problem caused by reuse of name r1
Write After Write (WAW) hazard
Not a problem in MIPS 5-stage pipeline
12
13. Name Dependences Solution –
Register Renaming
Rename registers either by compiler or hardware
How to overcome these?
13
14. Control Dependencies
if p1 {
…
S1;
…
};
if p2 {
…
S2;
…
}
S1 is control dependent on p1, & S2 is control
dependent on p2 but not on p1
Instructions that are control dependent can’t be
moved before or after the branch 14
15. Control Dependencies (Cont.)
Control dependence need not be preserved
Execute instructions that shouldn’t have been
executed, thereby violating control dependences
If can do so without affecting correctness of program
2 properties critical to program correctness
1. Data flow
2. Exception behavior
15
16. Data Flow
Actual flow of data among instructions that
produce results & those that consume them
Branches make flow dynamic, determine which
instruction is supplier of data
DADDU R1,R2,R3
BEQZ R4,L
DSUBU R1,R5,R6
L:
…
OR R7,R1,R8
OR depends on DADDU or DSUBU?
16
17. Exception Behaviour
Any changes in instruction execution order
mustn’t change how exceptions are raised in
program no new exceptions
DADDU R2,R3,R4
BEQZ R2,L1
LW R1,0(R2)
L1:
....
Can we move LW before BEQZ?
No data dependence
Control dependence? 17
18. Compiler Techniques
Consider following code
for (I = 999; I >= 0; I = i-1)
x[i] = x[i] + s;
Consider following latencies
Write program in Assembly
To simplify, assume 8 is lowest address 18
19. Compiler Techniques (Cont.)
Loop: LD F0,0(R1)
ADDD F4,F0,F2
SD F4,0(R1)
DADDUI R1,R1,#-8
BNE R1,R2,Loop
R – integer registers
F – floating point registers
R1 – highest address of array
F2 – s
F4 – result of computation
DADDUI – decrement pointer by
8 bytes
BNZ – branch not equal
Loop: LD F0,0(R1)
stall
ADDD F4,F0,F2
stall
stall
SD F4,0(R1)
DADDUI R1,R1,#-8
stall (assume int load latency 1)
BNE R1,R2,Loop
19
Need 9 clock cycles
20. Revised Code – Pipeline Scheduling
Loop: LD F0,0(R1)
DADDUI R1,R1,#-8
ADDD F4,F0,F2
stall
stall
SD F4,8(R1)
BNE R1,R2,Loop
Need 7 clock cycles
3 for execution (LD, ADDD,SD)
4 for loop overhead
How to make it even faster? 20
21. Solution – Loop Unrolling
Unroll by a factor of 4 – Assume no of elements is divisible by 4
Eliminate unnecessary instructions
Loop: LD F0,0(R1)
ADDD F4,F0,F2
SD F4,0(R1) ;drop DADDUI & BNE
LD F6,-8(R1)
ADDD F8,F6,F2
SD F8,-8(R1) ;drop DADDUI & BNE
LD F10,-16(R1)
ADDD F12,F10,F2
SD F12,-16(R1) ;drop DADDUI & BNE
LD F14,-24(R1)
ADDD F16,F14,F2
SD F16,-24(R1)
DADDUI R1,R1,#-32 ; 4 x 8
BNE R1,R2,Loop 21
1 cycle stall
2 cycle stall
1 cycle stall
27 clock cycles, 6.75 per iteration
23. Loop Unrolling
Usually don’t know upper bound of loop
Suppose it is n
Unroll loop to make k copies of the body
Generate a pair of consecutive loops
1st executes n mod k times & has a body that is the
original loop
2nd unrolled body surrounded by an outer loop that
iterates n/k times
For large n, most of the execution time spent in
unrolled loop
23
24. Conditions for Unrolling Loops
Loop unrolling requires understanding of
How 1 instruction depends on another
How instructions can be changed or reordered given
dependences
5 loop unrolling decisions
1. Determine loop unrolling useful by finding that loop
iterations were independent
2. Use different registers to avoid unnecessary
constraints forced by using same registers for
different computations
24
25. Conditions for Unrolling Loops (Cont.)
3. Eliminate extra test & branch instructions & adjust
loop termination & iteration code
4. Determine that loads & stores in unrolled loop can
be interchanged by observing that loads & stores
from different iterations are independent
Transformation requires analyzing memory addresses &
finding that they don’t refer to the same address
5. Schedule code, preserving any dependences
needed to yield the same result as original code
25
26. Limits of Loop Unrolling
Decrease in amount of overhead amortized with
each extra unrolling
Amdahl’s Law
Growth in code size
Larger loops increases the instruction cache miss rate
Register pressure
Potential shortfall in registers created by aggressive
unrolling & scheduling
Loop unrolling reduces impact of branches on
pipeline; another way is branch prediction
26
Editor's Notes
Here we don’t consider structural hazards
BEQZ – branch if equal to 0
What if memory address is illegal?