Exploiting ILP With Software Approach
Exploiting ILP With Software Approach
Exploiting ILP With Software Approach
2
4.1 Basic compiler techniques for exposing ILP
Basic pipeline scheduling and loop unrolling
Fig 1 latencies of FP operations used in this chapter
(ref figure 1)
• 1 Loop: LD F0,0(R1) ;F0=vector element x[i]
2 stall
3 ADD.D F4,F0,F2 ;add scalar s in F2
4 stall
5 stall
6 S.D F4, 0(R1) ;store result
7 DADDUI R1,R1,#-8 ;dec pointer 8B (DW)
8 stall
9 BNE R1,R2,Loop ;branch R1!=R2
10 stall
4
4.1 Basic compiler techniques for exposing ILP
Basic pipeline scheduling and loop unrolling
Schedule the previous example to obtain only one stall
• 1 Loop: LD F0,0(R1)
2 DADDUI R1,R1,#-8
3 ADD.D F4,F0,F2
4 stall
5 BNE R1,R2,Loop ;branch R1!=R2
6 SD F4, 8(R1) ;altered &
interchanged with DADDUI
5
4.1 Basic compiler techniques for exposing ILP
Basic pipeline scheduling and loop unrolling
8
4.1 Basic compiler techniques for exposing ILP
12
4.2 Static branch prediction
LD R1,0(R2)
DSUBU R1,R1,R3
BEQZ R1,L1
OR R4,R5,R6
DADDU R10,R4,R3
L: DADDU R7,R8,R9
• The dep of the DSUBU and BEQZ on the LD inst
means that a stall will be needed after the LD.
• Suppose that this branch was almost always
taken and that R7 was not needed on the fall-
through path
– Could increase the speed by moving DADDU to the
position after the LD.
• Suppose that this branch was rarely taken and
that R4 was not needed on the taken path
– Could increase the speed by moving OR to the
position after the LD.
13
4.2 Static branch prediction
Fig 3 Misprediction rate on SPEC92 for a profile-based predictor varies
widely but is generally better for the FP programs, which have an
average misprediction rate of 9% with a standard deviation of 4%,
than for the INT programs, which have an average misprediction
rate of 15% with a standard deviation of 5%
14
4.2 Static branch prediction
Fig 4 Accuracy of a predicted-taken strategy and a profile-
based predictor for SPEC92 benchmarks as measured
by the number of insts executed between mispredicted
branches and shown on a log scale
15
4.3 Static multiple issue: the VLIW
approach
• Very long instruction word
– 64 to 128 bits, or longer
– Early VLIWs were quite rigid in their formats
and effectively required recompilation for
different versions of the H/W
– To reduce this inflexibility and enhance the
performance of the approach, several
innovations have been incorporated into more
recent architectures of this type
16
4.3 Static multiple issue: the VLIW approach
The basic VLIW approach
• VLIW approaches make sense for wider
processors
– Eg. a VLIW processor might have insts that contain 5
operations, including 1 int operation (which could also
be a branch), 2 fp operations, and 2 memory
references
• The VLIW instruction
– It is a set of fields for each functional unit
– Perhaps 16 to 24 bits per unit
– Yielding an inst length of between 112 and 168 bits
17
4.3 Static multiple issue: the VLIW approach
The basic VLIW approach
Example suppose we have a VLIW that
could issue 2 mem refs, 2 fp operations,
and 1 int operation or branch in every
clock cycle. Show an unrolled version of
the loop x[i]=x[i]+s for such a processor.
Unroll as many times as necessary to
eliminate any stalls. Ignore the branch
delay slot.
18
4.3 Static multiple issue: the VLIW approach
The basic VLIW approach
Fig 5 VLIW insts that occupy the inner loop and replace the unrolled
sequence
Integer
Memory Memory FP
FP operation2 Operation/bra
Reference 1 Reference 2 operation2
nch
L.D F0, 0(R1) L.D F6, -8 (R1)
L.D F18, -32 (R1) L.D F22, -40(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2
S.D F12, -16 (R1) S.D F16, -24 (R1) DADDUI R1, R1, #56
22
4.3 Static multiple issue: the VLIW approach
The basic VLIW approach
• The potential advantages of a multiple-
issue processor versus a vector processor
are twofold
– 1. a multiple-issue processor has the potential
to extract some amount of parallelism from
less regularly structured code
– 2. It has the ability to use a more
conventional, and typically less expensive,
cache-based mem system
23
4.4 Advanced compiler support
for exposing and exploiting ILP
• Coverage
– Detecting and enhancing loop-level
parallelism
– Software pipeline: symbolic loop unrolling
– Global code scheduling
24
4.4 Advanced compiler support for exposing and exploiting ILP
25
4.4 Advanced compiler support for exposing and exploiting ILP
26
Detecting and Enhancing Loop-Level
Parallelism
• Ex 1)
for (i=1; i<=100; i++) {
A[i+1] = A[i] + C[i]; // S1
B[i+1] = B[i] + A[i+1]; // S2
}
– Dependence
A[2] = A[1] + C[1]; // S1 (i=1)
B[2] = B[1] + A[2]; // S2 (i=1)
A[3] = A[2] + C[2]; // S1 (i=2)
B[3] = B[2] + A[3]; // S2 (i=2)
A[4] = A[3] + C[3]; // S1 (i=3)
B[4] = B[3] + A[4]; // S2 (i=3)
……
• S1 S1: loop-carried dependence
• S2 S2: loop-carried dependence
• S1 S2: dependence in the same iteration
27
Detecting and Enhancing Loop-Level
Parallelism
• Ex 2
for (i=1; i<=100; i++) {
A[i] = A[i] + B[i]; // S1
B[i+1] = C[i] + D[i]; // S2
}
– Dependence
A[1] = A[1] + B[1]; // S1 (i=1)
B[2] = C[1] + D[1]; // S2 (i=1)
A[2] = A[2] + B[2]; // S1 (i=2)
B[3] = C[2] + D[2]; // S2 (i=2)
A[3] = A[3] + B[3]; // S1 (i=3)
B[4] = C[3] + D[3]; // S2 (i=3)
……
• S2 S1: loop-carried dependence
– Loop-level parallel because of no circular dependence
28
Detecting and Enhancing Loop-Level
Parallelism
• Ex 3)
A[1] = A[1] + B[1]
for (i=1; i<=99; i++) {
B[i+1] = C[i] + D[i]; // S2
A[i+1] = A[i+1] + B[i+1]; // S1
}
B[101] = C[100] +D[100]
– Dependence
A[1] = A[1] + B[1]; // S1
B[2] = C[1] + D[1]; // S2 (i=1)
A[2] = A[2] + B[2]; // S1 (i=1)
B[3] = C[2] + D[2]; // S2 (i=2)
A[3] = A[3] + B[3]; // S1 (i=2)
B[4] = C[3] + D[3]; // S2 (i=3)
……
29
Detecting and Enhancing Loop-Level
Parallelism
cycle
• Another example)
for (i=1; i<=100; i++) {
A[i] = A[i] + B[i]; // S1
B[i+1] = C[i] + A[i]; // S2
}
– Dependence
A[1] = A[1] + B[1]; // S1 (i=1)
B[2] = C[1] + A[1]; // S2 (i=1)
A[2] = A[2] + B[2]; // S1 (i=2)
B[3] = C[2] + A[2]; // S2 (i=2)
A[3] = A[3] + B[3]; // S1 (i=3)
B[4] = C[3] + A[3]; // S2 (i=3)
……
Circular dependence
• S2 S1: loop-carried dependence
• S1 S2: loop-level dependence
30
Detecting and Enhancing Loop-Level
Parallelism
• Recurrence (a kind of loop-carried dependence)
– Variable defined based on the value in an earlier iteration
– Example)
for (i=2; i<=100; i=i+1)
Y[i] = Y[i-1] + Y[i];
– Dependence distance in recurrence
– Example)
for (i=2; i<=100; i=i+1)
Y[i] = Y[i-5] + Y[i];
• Distance: 5
– More parallelism with larger distance
31
Detecting and Enhancing Loop-Level
Parallelism
• Recurrence (continued)
– Example)
for (i=6; i<=100; i=i+1)
Y[i] = Y[i-5] + Y[i];
• Distance: 5
Y[6] = Y[1] + Y[6]; i=6
Y[7] = Y[2] + Y[7]; i=7 Parallel execution
Y[8] = Y[3] + Y[8]; i=8 possible
Y[9] = Y[4] + Y[9]; i=9
Y[10] = Y[5] + Y[10]; i=10
Y[11] = Y[6] + Y[11]; i=11
Y[12] = Y[7] + Y[12]; i=12
Parallel execution
Y[13] = Y[8] + Y[13]; i=13 possible
Y[14] = Y[9] + Y[14]; i=14
Y[15] = Y[10] + Y[15]; i=15
Y[16] = Y[11] + Y[16]; i=16
32
4.4 Advanced compiler support for exposing and exploiting ILP
Detecting and enhancing loop-level
parallelism
The larger the distance, the more potential
parallelism can be obtained by unrolling the loop
• Finding the dependences is important in 3
tasks
– Good scheduling of code
– Determining which loops might contain
parallelism
– Eliminating name dependences
33
4.4 Advanced compiler support for exposing and exploiting ILP
Detecting and enhancing loop-level
parallelism
a×j + b and c×k + d, index runs from m to n
• A dependence exists if
– There are two iteration indices, j and k, both
within the limits of the for loop
That is, m≦j≦n, m ≦ k≦n
– The loop stores into an array element indexed
by a×j + b and later fetches from that same
array element when it is indexed by c×k + d
that is, a×j + b = c×k + d
34
1. Dependence analysis difficult for arrays & pointers
2. GCD (greatest common divisor) test:
-simple & sufficient test to guarantee that no
dependence exists.
(i) If loop-carried dependence exists, then GCD(c,a) must
divide (d-b).
(ii)If GCD(c,a) does not divide (d-b), no loop-carried
dependence.
35
4.4 Advanced compiler support for exposing and exploiting ILP
Detecting and enhancing loop-level
parallelism
Example The following loop has multiple types of
dependences. Find all the true dependences
and antidependences by renaming.
for (i:=1;i<=100;i=i+1) {
Y[i]=X[i] / c; /* S1 */
X[i]=X[i] + c; /* S2 */
Z[i]=Y[i] + c; /* S3 */
Y[i]=c – Y[i]; /* S4 */ }
Answer
1. Y[i] →true dep from S1 to s3 and from S1 to S4.
not loop-carried
2. X[i] → antidep from S1 to S2
3. Y[i] → antidep from S3 to S4
4. Y[i] → output dep from S1 to S4
36
4.4 Advanced compiler support for exposing and exploiting ILP
Detecting and enhancing loop-level
parallelism
eliminates false dependences
for (i:=1;i<=100;i=i+1) {
/* Y renamed to T to remove output dep */
T[i]=X[i] / c; /* S1 */
/* X renamed to X1 to remove antidep */
X1[i]=X[i] + c; /* S2 */
/* Y renamed to T to remove antidep */
Z[i]=T[i] + c; /* S3 */
Y[i]=c – Y[i]; /* S4 */ }
After the loop, the variable X has been renamed X1. In
code that follows the loop, the compiler can simply
replace the name X by X1.
37
4.4 Advanced compiler support for exposing and exploiting ILP
Detecting and enhancing loop-level
parallelism
• Situations in which array-oriented dep analysis
cannot tell us what we might want to know:
– When objects are referenced via pointers
– When array indexing is indirect through another array
– When a dep may exist for some value of inputs, but
does not exist in actuality when the code is run since
the inputs never take on those values
– When an optimization depends on knowing more than
just the possibility of a dep, but needs to know on
which write of a variable does a read of that variable
depend
38
4.4 Advanced compiler support for exposing and exploiting ILP
Detecting and enhancing loop-level
parallelism
• The basic approach used in points-to
analysis relies on information from
– 1. Type information, which restricts what a
pointer can point to
– 2. Information derived when an object is
allocated or when the address of an object is
taken, which can be used to restrict what a
pointer can point to
– 3. Information derived from pointer
assignment.
39
4.4 Advanced compiler support for exposing and exploiting ILP
Detecting and enhancing loop-level
parallelism
• Several cases where analyzing pointers has
been successfully applied and extremely useful:
– When pointers are used to pass the addr of an object
as a parameter, it is possible to use points-to analysis
to determine the possible set of objects referenced by
a pointer
– When a pointer can point to one of several types, it is
sometimes possible to determine the type of the data
object that a pointer designates at different parts of
the program
– It is often possible to separate out pointers that may
only point to a local object versus a global one
40
4.4 Advanced compiler support for exposing and exploiting ILP
Detecting and enhancing loop-level
parallelism
• Eliminating dependence computations
– Copy propagation
• Eliminates operations that copy values
DADDUI R1,R2,#4
DADDUI R1,R2,#8
DADDUI R1,R1,#4
– Tree height reduction 2 execution
cycles only
• Makes tree structure wider but shorter
R2 R3
ADD R1,R2,R3 R2 R3 R6 R7
ADD R1,R2,R3
R1 R6
ADD R4,R1,R6 ADD R4,R6,R7 R1 R4
ADD R8,R1,R4R4 R7
ADD R8,R4,R7 R8
R8 41
4.4 Advanced compiler support for exposing and exploiting ILP
Detecting and enhancing loop-level
parallelism
• Eliminating dependence computations
– Recurrences
• Depends on the previous iteration
sum = sum + x;
Unroll sum = sum + x1 + x2 + x3 + x4 + x5;
However, 5 dependent operations.
Rewrites
sum = ((sum + x1) + (x2 + x3)) + (x4 + x5);
only 3 dependent operations
42
4.4 Advanced compiler support for exposing and exploiting ILP
Software pipeline: symbolic loop
unrolling
• Software pipeline
– Reorganize loops such that each iteration in the
software-pipelined code is made from instructions
chosen from different iterations of the original loop
Fig 6 a software-pipelined loop chooses insts from different loop
iterations, separating the dep insts within one iteration of the original
loop
43
4.4 Advanced compiler support for exposing and exploiting ILP
Software pipeline: symbolic loop
unrolling
Example Show a software-pipelined version of
this loop, which increments all the elements of
an array whose starting address is in R1 by the
contents of F2:
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1)
DADDUI R1,R1,#-8
BNE R1,R2,Loop
You may omit the start-up and clean-up code.
44
Software Pipelining: Symbolic Loop Unrolling
• Example)
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1)
DAADUI R1,R1,#-8
BNE R1,R2,Loop software pipelining
– Unroll L.D F0,0(R1) // start-up code
Iteration 1: L.D F0,0(R1) ADD.D F4,F0,F2
ADD.D F4,F0,F2 DADDUI R1,R1,#-8
S.D F4,0(R1) L.D F0,0(R1)
DAADUI R1,R1,#-8 DADDUI R1,R1,#-8
Iteration 2: L.D F0,0(R1) Loop: S.D F4,16(R1) ; stores into M[i]
ADD.D F4,F0,F2 ADD.D F4,F0,F2 ; adds to M[i-1]
S.D F4,0(R1) L.D F0,0(R1) ; loads M[i-2]
DAADUI R1,R1,#-8 DAADUI R1,R1,#-8
Iteration 3: L.D F0,0(R1) BNE R1,R2,Loop
ADD.D F4,F0,F2 …… // finish-up code
S.D F4,0(R1)
DAADUI R1,R1,#-8
……
45
4.4 Advanced compiler support for exposing and exploiting ILP
Software pipeline: symbolic loop
unrolling
Fig 7 The execution pattern for (a) a software-pipelined
loop and (b) an unrolled loop
46
Software Pipelining: Symbolic Loop Unrolling
48
4.3 Static multiple issue: the VLIW approach
The basic VLIW approach
• Local scheduling techniques
– The loop unrolling generates straight-line code(works
well when the loop body is straight line code-easier to
find the repeatable schedule)
– Operate on a single basic block
• Global scheduling techniques
– Scheduling code across branches(internal control
flow-inner loop contains conditional loops)
– More complex in structure
– Must deal with significantly more complicated trade-
offs in optimization
49
4.4 Advanced compiler support for exposing and exploiting ILP
Global code scheduling
• Global code scheduling
– Effective scheduling of a loop body with internal control
flow will require moving insts across branches
– Aims to compact a code fragment with internal control
structure into the shortest possible sequence that
preserves the data and control dependence
– It can reduce the effect of control dependences arising
from conditional nonloop branches by moving code
– Does not guarantee faster code.
– Effectively using global code motion requires estimates
of the relative frequency of different paths
50
4.4 Advanced compiler support for exposing and exploiting ILP
Global code scheduling
Fig 8 A code fragment and the common path shaded with
gray
LD R4,0(R1) ;load A
LD R5,0(R2) ;load B
DADDU R4,R4,R5 ;add to A
SD R4,0(R1) ;store A
…
BNEZ R4,elsepart ;test A
… ;then part
SD …,0(R2) ;store to B
…
J join ;jump over else
elsepart: … ;else part
X ;code for X
…
join: … ;after if
SD …,0(R3) ;store to C
51
CONSTRAINTS
• Without affecting Data flow
• Exception(memory reference)
52
4.4 Advanced compiler support for exposing and exploiting ILP
Global code scheduling
What are the relative execution frequencies
of the then case and the else case in the
branch? If the then case is much more
frequent, the code motion may be beneficial.
54
4.4 Advanced compiler support for exposing and exploiting ILP
Global code scheduling
• Two steps to trace scheduling
– Trace selection
tries to find a likely sequence of basic blocks
whose operations will be put together into a
smaller number of insts(trace)
– Trace compaction
– (once trace is selected)
tries to squeeze the trace into a small number
of wide insts
55
4.4 Advanced compiler support for exposing and exploiting ILP
Global code scheduling
Fig 9 This trace is obtained by assuming that the program fragment in Fig 8 is
the inner loop and unwinding it four times, treating the shaded portion in Fig
8 as the likely path
56
TRACE SCHEDULING
DISADVANTAGE:
Entries and exits in the middle of the trace
requires some compensation code for
compilation
57
4.4 Advanced compiler support for exposing and exploiting ILP
Global code scheduling
• Superblocks
– are formed by a process similar to that used
for traces
– but are a form of extended basic blocks,
which are restricted to a single entry point but
allow multiple exits(so compacting becomes
easier since only code motion across an exit
need to be considered)
58
4.4 Advanced compiler support for exposing and exploiting ILP
Global code scheduling
Fig 10 This superblock results from unrolling the code in Fig 8 four
times and creating a superblock
59
Hardware support for exposing
more parallelism
• Techniques such as
-loop unrolling ,s/w pipelining ,Global and
trace scheduling improves ILP only when
the branch behaviors are predictable at
compile time.
60
4.5 Hardware support for exposing
more parallelism at compile time
• Conditional or predicated instructions
– An inst refers to a condition, which is
evaluated as part of the inst execution
– If the condition is true, the inst is executed
normally
– If the condition is false, the execution
continues as if the inst were no-op
61
4.5 Hardware support for exposing more parallelism at compile time
Conditional or predicated
instructions
Example Consider the following code:
if (A==0) {S=T;}
Assume that registers R1, R2, and R3 hold the values of
A, S, and T, respectively, show the code for this
statement with the branch and with the conditional move.
Answer A
BNEZ R1, L
ADDU R2,R3,R0 The straightforward code
L: S T
zero
The conditional instruction
CMOVZ R2,R3,R1
Conditional move that performs
the move only if R1=0 62
CONDITIONAL MOV
• Convert control dependence to data
dependence to eliminate the branch for
improving pipeline behavior.
63
4.5 Hardware support for exposing more parallelism at compile time
Conditional or predicated instructions
Example Here is a code sequence for a 2-issue superscalar that
can issue a combination of one memory reference and one ALU
operation, or a branch by itself, every cycle:
First inst slot Second inst slot
LW R1,40(R2) ADD R3,R4,R5
ADD R6,R3,R7
BEQZ R10,L
LW R8,0(R10)
LW R9,0(R8)
This sequence waste a memory operation slot and will incur a data
dep stall if the branch is not taken, since the 2nd LW after the
branch depends on the prior load. Show how the code can be
improved using a predicated form of LW.
64
4.5 Hardware support for exposing more parallelism at compile time
Conditional or predicated instructions
Answer
First inst slot Second inst slot
LW R1,40(R2) ADD R3,R4,R5
LWC R8,0(R10),R10 ADD R6,R3,R7
BEQZ R10,L Load occurs unless R10 is 0
LW R9,0(R8)
65
4.5 Hardware support for exposing more parallelism at compile time
Compiler speculation with hardware support
Hardware support for preserving exception behavior
• 4 methods for supporting speculation without
introducing erroneous exception
– The H/W and OS cooperatively ignore exceptions for
speculative instructions
• Preserves exception behavior for correct programs, but not for
incorrect ones
– Speculative insts that never raise exceptions are used,
and checks are introduced to determine when an
exception should occur
– A set of status bits, called poison bits, are attached to
the result registers written by speculation insts when the
insts cause exception
– A mechanism is provided to indicate that an inst is
speculative, and the H/W buffers the inst result until it is
certain that the inst is no longer speculative
66
4.5 Hardware support for exposing more parallelism at compile time
Compiler speculation with hardware support
Hardware support for preserving exception behavior
Example Consider the following code fragment from an if-
then-else statement of the form
if (A==0) A=B; else A=A+4;
where A is at 0(R3) and B is at 0(R2):
LD R1,0(R3) ;load A
BNEZ R1,L1 ;test A
LD R1,0(R2) ;then clause
J L2 ;skip else
L1: DADDI R1,R1,#4 ;else clause
L2: SD R1,0(R3) ;store A
Answer Here is the new code:
LD R1,0(R3) ;load A
LD R14,0(R2) ;speculative load B
BEQZ R1,L3 ;other branch of the if
DADDI R14,R1,#4 ;else clause
L3: SD R14,0(R3) ;nonspeculative store 67
4.5 Hardware support for exposing more parallelism at compile time
Compiler speculation with hardware support
Hardware support for preserving exception behavior
Example Show how the previous example can be coded
using a speculative load (sLD) and a speculation check
inst (SPECCK) to completely preserve exception
behavior. Assume R14 is available.
Answer Here is the code:
LD R1,0(R3) ;load A
sLD R14,0(R2) ;speculative, no termination
BNEZ R1,L1 ;test A
SPECCK 0(R2) ;perform speculation check
J L2 ;skip else
L1: DADDI R1,R1,#4 ;else clause
L2: SD R14,0(R3) ;store A
68
4.5 Hardware support for exposing more parallelism at compile time
Compiler speculation with hardware support
Hardware support for preserving exception behavior
Example Consider the code fragment from the original
example and show how it would be compiled with
speculative insts and poison bits. Show where an
exception for the speculative memory reference would
be reorganized. Assume R14 is available.
Answer Here is the code:
LD R1,0(R3) ;load A
sLD R14,0(R2) ;speculative load B
BEQZ R1,L3 ;
DADDI R14,R1,#4 ;
L3: SD R14,0(R3) ;exception for speculative LW
69
4.5 Hardware support for exposing more parallelism at compile time
Compiler speculation with hardware support
Hardware support for memory reference speculation
• A special instruction to check for address
conflicts
– It is left at the original location of the load
instruction
– Acts like a guardian
– The load is moved up across one or more
stores
70
4.5 Hardware support for exposing more parallelism at compile time
Compiler speculation with hardware support
Hardware support for memory reference speculation
• When a speculated load is executed
– The hardware saves the address of the
accessed memory location
– If a subsequent store changes the location
before the check instruction, then the
speculation has failed
• Speculation failure can be handled
– If only the load was speculated, then redo the
load at the point of the check instruction
– If additional insts that depend on the load
were also speculated, then redo them 71
4.6 Crosscutting issues: hardware versus software
trade-offs and limitations
• To speculate extensively, we must be able to
disambiguate memory reference.
– This capability is difficult to do at compile time for
integer programs that contains pointers.
– In a hardware-based scheme, dynamic run time
disambiguation of memory addresses is done using
the techniques we saw earlier for Tomasulo’s
algorithm.
– This disambiguation allows us to move loads past
stores at run time.
– Support for speculative memory references can help
overcome the conservatism of the compiler, but
unless such approaches are used carefully, the
overhead of the recovery mechanisms may swamp
the advantages. 72
4.6 Crosscutting issues: hardware versus software
trade-offs and limitations
• Hardware-based speculation works better
– when control flow is unpredictable,
– and when hardware-based branch prediction is
superior to software-based branch prediction done at
compile time.
– These properties hold for many integer programs.
– For example, a good static predictor has a
misprediction rate of about 16% for four major integer
SPC92 programs, and a hardware prediction has a
misprediction rate under 10%.
– Because speculated instructions may slow down the
computation when the prediction is incorrect, this
difference is significant.
– One result of this difference is that even statically
scheduled processors normally include dynamic
branch predictors.
73
4.6 Crosscutting issues: hardware versus software
trade-offs and limitations
• Hardware-based speculation maintains a
completely precise exception model even for
speculated instructions. Recent software-based
approaches have added special support to allow
this as well.
• Hardware-based speculation does not require
compensation or bookkeeping code, which is
needed by ambitious software speculation
mechanism.
• Compiler-based approaches may benefit from
the ability to see further in the code sequence,
resulting in better code scheduling than a purely
hardware-driven approach.
74
4.6 Crosscutting issues: hardware versus software
trade-offs and limitations
• Hardware-based speculation with dynamic
scheduling does not require different code
sequences to achieve good performance for
different implementations of an architecture.
– Although this advantage is the hardest to quantify, it
may be the most important in the long run.
– Interestingly, this was one of the motivations for the
IBM 360/91.
– On the other hand, more recent explicitly parallel
architectures, such as IA-64, have added flexibility
that reduces the hardware dependence inherent in a
code sequence.
75
4.7 Putting it all together: the Intel IA-64 architecture and
Itanium processor
The Intel IA-64 instruction set architecture
• The IA-64 register model
– 128 64-bit GPRs
– 128 82-bit FPRs, which provide two extra
exponent bits over the standard 80-bit IEEE
format
– 64 1-bit predicate registers
– 8 64-bit branch registers, which are used for
indirect branches
– A variety of registers used for system control,
memory mapping, performance counters, and
communication with the OS
76
4.7 Putting it all together: the Intel IA-64 architecture and
Itanium processor
The Intel IA-64 instruction set architecture
• Instruction format and support for explicit
parallelism
Fig 11 The five execution unit slots in the IA-64 architecture and
what instructions types they may hold are shown
Execution Instruction Instruction
unit slot type description Example instructions
A Integer ALU add, subtract, and, or,
compare
I-unit
I Non-ALU Integer and multimedia
integer shifts, bit tests, moves
execute cycle
Bundle (1
template Slot 0 Slot1 Slot2 bundle/cycle)
8: M M I L.D F0, 0(R1) L.D F6, -8(R1) 1
9: M M I L.D F10, -16(R1) L.D F14, -24(R1) 2
14: M M F L.D F18, -32(R1) L.D F22, -40(R1) ADD.D F4,F0,F2 3
14: M M F L.D F26, -48(R1) ADD.D F8, F6, F2 4
15: M M F ADD.D F12, F10, F2 5
14: M M F S.D F4, 0(R1) ADD.D F16, F14, F2 6
14: M M F S.D F8, -8(R1) ADD.D F20, F18, F2 7
15: M M F S.D F12, -16(R1) ADD.D F24, F22 , F2 8
14: M M F S.D F16, -24(R1) ADD.D F28, F26, F2 9
9: M M I S.D F20, -32(R1) S.D F24, -40(R1) 11
8: M M I S.D F28, -48(R1) DADDUI R1, R1, #- BNE R1, R2, Loop 12
56
81
4.7 Putting it all together: the Intel IA-64 architecture and
Itanium processor
The Intel IA-64 instruction set architecture
• Instruction set basics
– Five primary instruction classes:
• A, I, M, F, and B
– Each IA-64 instruction is 41 bits in length
• The high-order 4 bits, together with the bundle bits
that specify the execution unit slot, are used as the
major opcode.
• The low-order 6 bits are used for specifying the
predicate register that guards the instruction
82
Fig 14 A summary of some of the instruction formats of the IA-64 ISA
I n s t ru ct i on N u m b e r of Re p re s e n t a t i v e E x t ra op cod e GP Rs / I m m e d ia t e O t h e r/com me n t
Ty p e f orm a t s i n s t ru ct i on s bits F P Rs b it s
A 8 Add, su btr a ct , 9 3 0
a n d, or
S h ift left a n d a dd 7 3 0 2-bit sh ift cou n t
ALU im m edia tes 9 2 8
Add imm edia t e 3 2 14
Add imm edia t e 0 2 22
Compa r e 4 2 0 2 pr edica te r egist er
dest in a tion s
F 15 F P a r it h met ic 2 4
F P compa r e 2 2 2 6-bit pr edica t e
L+X 4 Move im m edia t e 2 1 64 r egs 83
lon g
4.7 Putting it all together: the Intel IA-64 architecture and
Itanium processor
The Intel IA-64 instruction set architecture
• Predication support
– An instruction is predicated by specifying a
predicate register, whose identity is placed in
the lower 6 bits of each instruction field.
– Nearly every instruction in the IA-64
architecture can be predicated
– Both if conversion and code motion have
lower overhead
84
4.7 Putting it all together: the Intel IA-64 architecture and
Itanium processor
The Intel IA-64 instruction set architecture
• Speculation support
– Support for control speculation, which deals
with deferring exception for speculated
instructions
– Support for mem reference speculation, which
supports speculation of load instructions
• Deferred exception handling
– Supported by providing the equivalent of
poison bits
85
4.7 Putting it all together: the Intel IA-64 architecture and
Itanium processor
The Intel IA-64 instruction set architecture
• Deferred exception handling
– For the GPRs, these bits are called NaTs (Not
a Thing), and this extra bit makes 65 bits wide
– For the FPRs, this capability is obtained using
a special value, NaTVal (Not a Thing Value)
• This value is encoded using a significant of 0 and
an exponent outside of the IEEE rage
– Only speculative load generate such values,
but all instructions that do not affect mem will
cause a Nat or NatVal to be propagated to the
result register
86
4.7 Putting it all together: the Intel IA-64 architecture and
Itanium processor
The Intel IA-64 instruction set architecture
• Memory reference support in the IA-64
– Use a concept called advanced loads
• A load that has been speculatively moved above
stores on which it is potentially dependent
– ld.a – for advanced load
• Executing this inst creates an entry in a special
table, called the ALAT
• The ALAT stores both the reg destination of the
load and the addr of the accessed mem location
87
4.7 Putting it all together: the Intel IA-64 architecture and
Itanium processor
The Itanium processor
• The Itanium processor
– The first implementation of the IA-64
– Available in mid-2001 with an 800 MHz
– Up to 6 issues per clock
• Up to 3 branches and 2 mem references
– 3-level cache
• 1st-level: split caches, FP data are not in it
• 2nd-level: unified
• 3rd-level: unified, off-chip 4 MB
88
4.7 Putting it all together: the Intel IA-64 architecture and
Itanium processor
The Itanium processor
• 9 Functional units
– 2 I-units, 2 M-units, 3 B-units, 2 F-units
Fig 15 The latency of some typical instructions on Itanium
Instruction Latency
Integer load 1
Floating-point load 9
Mispredicted branch 9
FP arithmetic 4
89
4.7 Putting it all together: the Intel IA-64 architecture and
Itanium processor
The Itanium processor
• Instruction issue
– It has an instruction issue window that
contains up to 2 bundles
– It can issue up to 6 instructions in a clock
90
4.7 Putting it all together: the Intel IA-64 architecture and
Itanium processor
The Itanium processor
• 10-satge pipeline
– Front-end (stages IPG, Fetch, and Route)
• Prefetches up to 32 bytes per clock (2 bundles)
• Prefetch buffer can hold up to 8 bundles (24 insts)
– Instruction delivery (stages EXP and REN)
• Distributes up to 6 insts to the 9 FUs
• Implements reg renaming for both rotation and reg stacking
– Operand delivery (WLD and REG)
• Accesses the reg file, performs reg bypassing, accesses and
updates a reg scoreboard, and checks predicate
dependences
– Execution (EXE, DET, and WRB)
• Executes insts through ALUs and load-store units, detects
exceptions and posts NaTs, retires insts, and performs write91
back
Fig 16 The SPECint benchmark set shows that the Itanium is
considerably slower than either the Alpha 21264 or the Pentium 4
92
Fig 17 The SPECfp benchmark set shows that the Itanium is somewhat
faster than either the Alpha 21264 or the Pentium 4
93
4.8 Another view: ILP in the embedded and mobile markets
Trimedia and Crusoe
• The Trimedia TM32 architecture
– It is a classic VLIW architecture
– Every instruction contains 5 operations
– It is completely statically scheduled
94
4.8 Another view: ILP in the embedded and mobile markets
Trimedia and Crusoe
Example First compile the loop for the following c code into
MIPS instructions, and then show what it might look like
if the Trimedia processor’s operations fields were the
same as MIPS instructions. (In fact, the Trimedia
operation types are very close to MIPS instructions in
capability.) Assume the functional unit capacities and
latencies shown in Figure 18.
void sum (int a[], int b[], int c[], int n)
{ int I;
for (i=0; i<n; i++)
c[i] = a[i] + b[i];
}
Unroll the loop so there are up to 4 copies of the body, if
needed.
Answer Figure 19 & 20. 95
Fig 18 There are 23 functional units of 11 different types in
the Trimedia CPU
Operation slots
LD R11,0(R4) LD R12,R0(R5)
98
Fig 21 The performance and the code size for the EEMBC consumer
benchmarks run on the Trimedia TM1300 and the NEC VR5000 and
shown relative to the performance and code size for the low-end NEC
VR4122.
99
4.8 Another view: ILP in the embedded and mobile markets
Trimedia and Crusoe
• The Transmeta Crusoe processor
– A VLIW processor designed for low-power
– Instruction set compatibility with the x86
• Through a software system that translates from
x86 to the VLIW
– In-order execution
– Instruction size:
• 64 bits – 2 operations
• 128 bits – 4 operations
100
4.8 Another view: ILP in the embedded and mobile markets
Trimedia and Crusoe
• The Crusoe 5 types of operation slots
– ALU: typical RISC ALU operations
– Compute: this slot may specify any integer
ALU (there are 2 integer ALUs), a FP
operations, or a multimedia operation
– Memory: a load or store
– Branch: a branch
– Immediate: a 32-bit immediate used by
another operation in the instruction
101
4.8 Another view: ILP in the embedded and mobile markets
Trimedia and Crusoe
Fig 22 The energy performance of the processor and
memory interface modules using two multimedia
benchmarks is shown for the Mobile Pentium III and the
Transmeta 3200.
102
4.8 Another view: ILP in the embedded and mobile markets
Trimedia and Crusoe
Fig 23 Power distribution inside a laptop during DVD playback shows
that the processor subsystem consumes only 20% of the power!
104