Exploiting ILP With Software Approach

Chapter 4
Exploiting ILP with Software Approach

• 4.1 Basic compiler techniques for exposing ILP
• 4.2 Static branch prediction
• 4.3 Static multiple issue: the VLIW approach
• 4.4 Advanced compiler support for exposing and
exploiting ILP
• 4.5 Hardware support for exposing more parallelism at
compile time
• 4.6 Crosscutting issues: hardware versus software
• 4.7 Putting it all together: the Intel IA-64 architecture and
Itanium processor
• 4.8 Another view: ILP in the embedded and mobile
markets
• 4.9 Fallacies and pitfalls
1
4.1 Basic compiler techniques
for exposing ILP
• This chapter starts by examining the use of
compiler technology to improve the performance
of pipelines and simple multiple-issue
processors
• Three parts of this section
– Basic pipeline scheduling and loop unrolling
– Summary of the loop unrolling and scheduling
example
– Using loop unrolling and pipeline scheduling with
static multiple issue
2
4.1 Basic compiler techniques for exposing ILP
Basic pipeline scheduling and loop unrolling
Fig 1 latencies of FP operations used in this chapter
Instruction Instruction using Latency in clock

producing result result cycles
Another FP ALU
FP ALU op 3
op
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0

3
for (i=1000; i>0; i=i-1)

x[i] = x[i] +s;
(ref figure 1)
• 1 Loop: LD F0,0(R1) ;F0=vector element x[i]
2 stall
3 ADD.D F4,F0,F2 ;add scalar s in F2
4 stall
5 stall
6 S.D F4, 0(R1) ;store result
7 DADDUI R1,R1,#-8 ;dec pointer 8B (DW)
8 stall
9 BNE R1,R2,Loop ;branch R1!=R2
10 stall
4
Schedule the previous example to obtain only one stall
• 1 Loop: LD F0,0(R1)
2 DADDUI R1,R1,#-8
3 ADD.D F4,F0,F2
4 stall
5 BNE R1,R2,Loop ;branch R1!=R2
6 SD F4, 8(R1) ;altered &
interchanged with DADDUI
5
Eliminated three branches and three decrements of R1

this loop will run in 28 clock cycles
• 1 Loop: L.D F0,0(R1)
2 ADD.D F4,F0,F2
3 S.D F4,0(R1) ;drop DADDUI & BNE
4 L.D F6,-8(R1)
5 ADD.D F8,F6,F2
6 S.D F8,-8(R1) ;drop DADDUI & BNE
7 L.D F10,-16(R1)
8 ADD.D F12,F10,F2
10 L.D F14,-24(R1)
11 ADD.D F16,F14,F2
12 S.D F16,-24(R1)
13 DADDUI R1,R1,#-32
14 BNE R1,R2,Loop
6
Loop unrolling
• 1 Loop:L.D F0,0(R1) • What assumptions made
2 L.D F6,-8(R1) when moved code?
3 L.D F10,-16(R1) – OK to move store past
4 L.D F14,-24(R1) SUBI even though
5 ADD.D F4,F0,F2
changes register
6 ADD.D F8,F6,F2
7 ADD.D F12,F10,F2 – OK to move loads before
8 ADD.D F16,F14,F2 stores: get right data?
9 S.D F4,0(R1) – When is it safe for
10 S.D F8,-8(R1) compiler to do such
changes?
12 S.D F12,16(R1)
13 BNE R1,R2,Loop
14 SD F16,8(R1); 8-32 = -24
14 clock cycles, or 3.5 per iteration

When safe to move instructions?
7
Summary of the loop unrolling and
scheduling example
The arrows show the data dependence
• 1 Loop: L.D F0,0(R1)
2 ADD.D F4,F0,F2
3 S.D F4,0(R1)
4 DADDUI R1,R1,#-8 ;drop BNE
5 L.D F6,0(R1)
6 ADD.D F8,F6,F2
7 S.D F8,0(R1)
8 DADDUI R1,R1,#-8 ; drop BNE
9 L.D F10,0(R1)
10 ADD.D F12,F10,F2
11 S.D F12,0(R1)
12 DADDUI R1,R1,#-8 ; drop BNE
13 L.D F14,0(R1)
14 ADD.D F16,F14,F2
15 S.D F16,0(R1)
16 DADDUI R1,R1,#-8
17 BNE R1,R2,Loop
8

scheduling example
true data dep antidep output dep
• 1 Loop: L.D F0,0(R1)
2 ADD.D F4,F0,F2
4 L.D F0,-8(R1)
5 ADD.D F4,F0,F2
7 L.D F0,-16(R1)
8 ADD.D F4,F0,F2
10 L.D F0,-24(R1)
11 ADD.D F4,F0,F2
12 S.D F4,-24(R1)
14 BNE R1,R2,Loop
9

scheduling example
true data dep reg renaming → no antidep, no output dep
• 1 Loop: L.D F0,0(R1)
2 ADD.D F4,F0,F2
4 L.D F6,-8(R1)
5 ADD.D F8,F6,F2
7 L.D F10,-16(R1)
8 ADD.D F12,F10,F2
10 L.D F14,-24(R1)
11 ADD.D F16,F14,F2
12 S.D F16,-24(R1)
14 BNE R1,R2,Loop
10
Using loop unrolling and pipeline
scheduling with static multiple issue
Fig 2 The unrolled and scheduled code as it
would look on a superscalar MIPS
Integer instruction FP instruction Clock
cycle
Loop: L.D F0, 0(R1) 1
L.D F6, -8(R1) 2
L.D F10, -16(R1) ADD.D F4, F0, F2 3
L.D F14, -24(R1) ADD.D F8, F6, F2 4
L.D F18, -32(R1) ADD.D F12, F10, F2 5
S.D F4, 0(R1) ADD.D F16, F14, F2 6
S.D F8, -8(R1) ADD.D F20, F18, F2 7
S.D F12, -16(R1) 8
DADDUI R1, R1, #-40 9
S.D F16, 16(R1) 10
BNE R1, R2, Loop 11
S.D F20, 8 (R1) 12 11
4.2 Static branch prediction
• Used in that branch behavior is highly
predictable
– Delayed branch can reduce the control
hazard
– Being able to accurately predict a branch at
compile time is also helpful for scheduling
data hazards
12
LD R1,0(R2)
DSUBU R1,R1,R3
BEQZ R1,L1
OR R4,R5,R6
DADDU R10,R4,R3
L: DADDU R7,R8,R9
• The dep of the DSUBU and BEQZ on the LD inst
means that a stall will be needed after the LD.
• Suppose that this branch was almost always
taken and that R7 was not needed on the fall-
through path
– Could increase the speed by moving DADDU to the
position after the LD.
• Suppose that this branch was rarely taken and
that R4 was not needed on the taken path
– Could increase the speed by moving OR to the
position after the LD.
13
Fig 3 Misprediction rate on SPEC92 for a profile-based predictor varies
widely but is generally better for the FP programs, which have an
average misprediction rate of 9% with a standard deviation of 4%,
than for the INT programs, which have an average misprediction
rate of 15% with a standard deviation of 5%
14
Fig 4 Accuracy of a predicted-taken strategy and a profile-
based predictor for SPEC92 benchmarks as measured
by the number of insts executed between mispredicted
branches and shown on a log scale
15
4.3 Static multiple issue: the VLIW
approach
• Very long instruction word
– 64 to 128 bits, or longer
– Early VLIWs were quite rigid in their formats
and effectively required recompilation for
different versions of the H/W
– To reduce this inflexibility and enhance the
performance of the approach, several
innovations have been incorporated into more
recent architectures of this type
16
4.3 Static multiple issue: the VLIW approach
The basic VLIW approach
• VLIW approaches make sense for wider
processors
– Eg. a VLIW processor might have insts that contain 5
operations, including 1 int operation (which could also
be a branch), 2 fp operations, and 2 memory
references
• The VLIW instruction
– It is a set of fields for each functional unit
– Perhaps 16 to 24 bits per unit
– Yielding an inst length of between 112 and 168 bits
17
Example suppose we have a VLIW that
could issue 2 mem refs, 2 fp operations,
and 1 int operation or branch in every
clock cycle. Show an unrolled version of
the loop x[i]=x[i]+s for such a processor.
Unroll as many times as necessary to
eliminate any stalls. Ignore the branch
delay slot.
18
Fig 5 VLIW insts that occupy the inner loop and replace the unrolled
sequence
Integer
Memory Memory FP
FP operation2 Operation/bra
Reference 1 Reference 2 operation2
nch
L.D F0, 0(R1) L.D F6, -8 (R1)
L.D F10, -16 (R1) L.D F14, -24 (R1)
L.D F18, -32 (R1) L.D F22, -40(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2
F12, F10, F16, F14,

L.D F26, -48 (R1) ADD.D ADD.D
F2 F2
F20, F18, F24, F22,
ADD.D ADD.D
F2 F2
F28, F26,
S.D F4, 0 (R1) S.D F8, -8 (R1) ADD.D
F2
S.D F12, -16 (R1) S.D F16, -24 (R1) DADDUI R1, R1, #56
S.D F20, 24 (R1) S.D F24, 16 (R1)
S.D F28, 8 (R1) 19

BNE R1, R2, Loop
• The technical problems of the original
VLIW
– They are the increase in code size and the
limitations of lockstep operation
– Two elements combine to increase code size
• 1. generating enough operations in a straight-line
code fragment requires ambitiously unrolling loops,
thereby increasing code size
• 2. whenever insts are not full, the unused
functional units translate to wasted bits in the inst
encoding
20
• To combat this code size increase
– Clever encodings are sometimes used
– Eg. there may be only one large immediate
field for use by any functional unit
– Another technique is to compress the ints in
main mem and expand them when they are
read into the cache or are decoded
• Early VLIWs operated in lockstep
– There was no hazard detection H/W at all
21
• Binary code compatibility problem
– In a strict VLIW approach, the code sequence makes
use of both the inst set definition and the detailed
pipeline structure
– Different numbers of functional units and unit
latencies require different versions of the code
– One possible solution is object-code translation or
emulation
– Another approach is to temper the strictness of the
approach so that binary compatibility is still feasible
22
• The potential advantages of a multiple-
issue processor versus a vector processor
are twofold
– 1. a multiple-issue processor has the potential
to extract some amount of parallelism from
less regularly structured code
– 2. It has the ability to use a more
conventional, and typically less expensive,
cache-based mem system
23
4.4 Advanced compiler support
for exposing and exploiting ILP
• Coverage
– Detecting and enhancing loop-level
parallelism
– Software pipeline: symbolic loop unrolling
– Global code scheduling
24
4.4 Advanced compiler support for exposing and exploiting ILP
Detecting and enhancing loop-level

parallelism
• Loop-level analysis
– Involves determining what dependences exist
among the operands in a loop across the
iterations of that loop
– This text consider only data dependences
• arises when an operand is written at some point
and read at a later point
• Name dependences also exist and may be
removed by renaming techniques
25

parallelism
• Loop-carried dependence
– Data accesses in later iterations are
dependent on data values produced in earlier
iterations
• Loop-level parallel
– No loop-carried dependences
for (i=100;i>0;i=i-1) x[i]=x[i]+s;
26
Detecting and Enhancing Loop-Level
Parallelism
• Ex 1)
for (i=1; i<=100; i++) {
A[i+1] = A[i] + C[i]; // S1
B[i+1] = B[i] + A[i+1]; // S2
}
– Dependence
A[2] = A[1] + C[1]; // S1 (i=1)
B[2] = B[1] + A[2]; // S2 (i=1)
A[3] = A[2] + C[2]; // S1 (i=2)
B[3] = B[2] + A[3]; // S2 (i=2)
A[4] = A[3] + C[3]; // S1 (i=3)
B[4] = B[3] + A[4]; // S2 (i=3)
……
• S1  S1: loop-carried dependence
• S1  S2: dependence in the same iteration
27
Parallelism
• Ex 2
for (i=1; i<=100; i++) {
A[i] = A[i] + B[i]; // S1
B[i+1] = C[i] + D[i]; // S2
}
– Dependence
A[1] = A[1] + B[1]; // S1 (i=1)
B[2] = C[1] + D[1]; // S2 (i=1)
A[2] = A[2] + B[2]; // S1 (i=2)
B[3] = C[2] + D[2]; // S2 (i=2)
A[3] = A[3] + B[3]; // S1 (i=3)
B[4] = C[3] + D[3]; // S2 (i=3)
……
– Loop-level parallel because of no circular dependence
28
Parallelism
• Ex 3)
A[1] = A[1] + B[1]
for (i=1; i<=99; i++) {
B[i+1] = C[i] + D[i]; // S2
A[i+1] = A[i+1] + B[i+1]; // S1
}
B[101] = C[100] +D[100]
– Dependence
A[1] = A[1] + B[1]; // S1
B[2] = C[1] + D[1]; // S2 (i=1)
A[2] = A[2] + B[2]; // S1 (i=1)
B[3] = C[2] + D[2]; // S2 (i=2)
A[3] = A[3] + B[3]; // S1 (i=2)
B[4] = C[3] + D[3]; // S2 (i=3)
……
29
Parallelism
cycle
• Another example)
for (i=1; i<=100; i++) {
A[i] = A[i] + B[i]; // S1
B[i+1] = C[i] + A[i]; // S2
}
– Dependence
A[1] = A[1] + B[1]; // S1 (i=1)
B[2] = C[1] + A[1]; // S2 (i=1)
A[2] = A[2] + B[2]; // S1 (i=2)
B[3] = C[2] + A[2]; // S2 (i=2)
A[3] = A[3] + B[3]; // S1 (i=3)
B[4] = C[3] + A[3]; // S2 (i=3)
……
Circular dependence
• S1  S2: loop-level dependence
30
Parallelism
• Recurrence (a kind of loop-carried dependence)
– Variable defined based on the value in an earlier iteration
– Example)
for (i=2; i<=100; i=i+1)
Y[i] = Y[i-1] + Y[i];
– Dependence distance in recurrence
– Example)
for (i=2; i<=100; i=i+1)
Y[i] = Y[i-5] + Y[i];
• Distance: 5
– More parallelism with larger distance
31
Parallelism
• Recurrence (continued)
– Example)
for (i=6; i<=100; i=i+1)
Y[i] = Y[i-5] + Y[i];
• Distance: 5
Y[6] = Y[1] + Y[6]; i=6
Y[7] = Y[2] + Y[7]; i=7 Parallel execution
Y[8] = Y[3] + Y[8]; i=8 possible
Y[9] = Y[4] + Y[9]; i=9
Y[10] = Y[5] + Y[10]; i=10
Y[11] = Y[6] + Y[11]; i=11
Y[12] = Y[7] + Y[12]; i=12
Parallel execution
Y[13] = Y[8] + Y[13]; i=13 possible
Y[14] = Y[9] + Y[14]; i=14
Y[15] = Y[10] + Y[15]; i=15
Y[16] = Y[11] + Y[16]; i=16
32
parallelism
The larger the distance, the more potential
parallelism can be obtained by unrolling the loop
• Finding the dependences is important in 3
tasks
– Good scheduling of code
– Determining which loops might contain
parallelism
– Eliminating name dependences
33
parallelism
a×j ＋ b and c×k ＋ d, index runs from m to n
• A dependence exists if
– There are two iteration indices, j and k, both
within the limits of the for loop
That is, m≦j≦n, m ≦ k≦n
– The loop stores into an array element indexed
by a×j ＋ b and later fetches from that same
array element when it is indexed by c×k ＋ d
that is, a×j ＋ b = c×k ＋ d
34
1. Dependence analysis difficult for arrays & pointers
2. GCD (greatest common divisor) test:
-simple & sufficient test to guarantee that no
dependence exists.
(i) If loop-carried dependence exists, then GCD(c,a) must
divide (d-b).
(ii)If GCD(c,a) does not divide (d-b), no loop-carried
dependence.
(iii)GCD test example)

for (i=1; i<=100; i=i+1)
X[2*i+3] = X[2*i] * 5.0;
a=2, b=3, c=2, d=0
GCD(a,c)=2, d-b=-3,
 no dependence(2 does not divide -3)
35
parallelism
Example The following loop has multiple types of
dependences. Find all the true dependences
and antidependences by renaming.
for (i:=1;i<=100;i=i+1) {
Y[i]=X[i] / c; /* S1 */
X[i]=X[i] + c; /* S2 */
Z[i]=Y[i] + c; /* S3 */
Y[i]=c – Y[i]; /* S4 */ }
Answer
1. Y[i] →true dep from S1 to s3 and from S1 to S4.
not loop-carried
2. X[i] → antidep from S1 to S2
3. Y[i] → antidep from S3 to S4
4. Y[i] → output dep from S1 to S4
36
parallelism
eliminates false dependences
for (i:=1;i<=100;i=i+1) {
/* Y renamed to T to remove output dep */
T[i]=X[i] / c; /* S1 */
/* X renamed to X1 to remove antidep */
X1[i]=X[i] + c; /* S2 */
/* Y renamed to T to remove antidep */
Z[i]=T[i] + c; /* S3 */
Y[i]=c – Y[i]; /* S4 */ }
After the loop, the variable X has been renamed X1. In
code that follows the loop, the compiler can simply
replace the name X by X1.
37
parallelism
• Situations in which array-oriented dep analysis
cannot tell us what we might want to know:
– When objects are referenced via pointers
– When array indexing is indirect through another array
– When a dep may exist for some value of inputs, but
does not exist in actuality when the code is run since
the inputs never take on those values
– When an optimization depends on knowing more than
just the possibility of a dep, but needs to know on
which write of a variable does a read of that variable
depend
38
parallelism
• The basic approach used in points-to
analysis relies on information from
– 1. Type information, which restricts what a
pointer can point to
– 2. Information derived when an object is
allocated or when the address of an object is
taken, which can be used to restrict what a
pointer can point to
– 3. Information derived from pointer
assignment.
39
parallelism
• Several cases where analyzing pointers has
been successfully applied and extremely useful:
– When pointers are used to pass the addr of an object
as a parameter, it is possible to use points-to analysis
to determine the possible set of objects referenced by
a pointer
– When a pointer can point to one of several types, it is
sometimes possible to determine the type of the data
object that a pointer designates at different parts of
the program
– It is often possible to separate out pointers that may
only point to a local object versus a global one
40
parallelism
• Eliminating dependence computations
– Copy propagation
• Eliminates operations that copy values
DADDUI R1,R2,#4
DADDUI R1,R2,#8
DADDUI R1,R1,#4
– Tree height reduction 2 execution
cycles only
• Makes tree structure wider but shorter
R2 R3
ADD R1,R2,R3 R2 R3 R6 R7
ADD R1,R2,R3
R1 R6
ADD R4,R1,R6 ADD R4,R6,R7 R1 R4
ADD R8,R1,R4R4 R7
ADD R8,R4,R7 R8
R8 41
parallelism
• Eliminating dependence computations
– Recurrences
• Depends on the previous iteration
sum = sum + x;
Unroll  sum = sum + x1 + x2 + x3 + x4 + x5;
However, 5 dependent operations.
Rewrites
 sum = ((sum + x1) + (x2 + x3)) + (x4 + x5);
only 3 dependent operations
42
Software pipeline: symbolic loop
unrolling
• Software pipeline
– Reorganize loops such that each iteration in the
software-pipelined code is made from instructions
chosen from different iterations of the original loop
Fig 6 a software-pipelined loop chooses insts from different loop
iterations, separating the dep insts within one iteration of the original
loop
43
unrolling
Example Show a software-pipelined version of
this loop, which increments all the elements of
an array whose starting address is in R1 by the
contents of F2:
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1)
DADDUI R1,R1,#-8
BNE R1,R2,Loop
You may omit the start-up and clean-up code.
44
Software Pipelining: Symbolic Loop Unrolling
• Example)
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1)
DAADUI R1,R1,#-8
BNE R1,R2,Loop software pipelining
– Unroll L.D F0,0(R1) // start-up code
Iteration 1: L.D F0,0(R1) ADD.D F4,F0,F2
ADD.D F4,F0,F2 DADDUI R1,R1,#-8
S.D F4,0(R1) L.D F0,0(R1)
DAADUI R1,R1,#-8 DADDUI R1,R1,#-8
Iteration 2: L.D F0,0(R1) Loop: S.D F4,16(R1) ; stores into M[i]
ADD.D F4,F0,F2 ADD.D F4,F0,F2 ; adds to M[i-1]
S.D F4,0(R1) L.D F0,0(R1) ; loads M[i-2]
DAADUI R1,R1,#-8 DAADUI R1,R1,#-8
Iteration 3: L.D F0,0(R1) BNE R1,R2,Loop
ADD.D F4,F0,F2 …… // finish-up code
S.D F4,0(R1)
DAADUI R1,R1,#-8
……
45
unrolling
Fig 7 The execution pattern for (a) a software-pipelined
loop and (b) an unrolled loop
46
Software Pipelining: Symbolic Loop Unrolling
• Comparison with loop unrolling

– Less code space
– Reduce the time when loop is NOT running at
peak speed
 Fig. 4.7
– Loop unrolling reduces loop control overhead, but still
exists
– More complex
• Combination of software pipelining and loop unrolling is
desirable for small loop body and large latency
47
OVERHEADS
• Many loops requires significant
transformation before pipelining.
• Issue of register management creates
additional complexity.
48
• Local scheduling techniques
– The loop unrolling generates straight-line code(works
well when the loop body is straight line code-easier to
find the repeatable schedule)
– Operate on a single basic block
• Global scheduling techniques
– Scheduling code across branches(internal control
flow-inner loop contains conditional loops)
– More complex in structure
– Must deal with significantly more complicated trade-
offs in optimization
49
Global code scheduling
• Global code scheduling
– Effective scheduling of a loop body with internal control
flow will require moving insts across branches
– Aims to compact a code fragment with internal control
structure into the shortest possible sequence that
preserves the data and control dependence
– It can reduce the effect of control dependences arising
from conditional nonloop branches by moving code
– Does not guarantee faster code.
– Effectively using global code motion requires estimates
of the relative frequency of different paths
50
Fig 8 A code fragment and the common path shaded with
gray
LD R4,0(R1) ;load A
LD R5,0(R2) ;load B
DADDU R4,R4,R5 ;add to A
SD R4,0(R1) ;store A
…
BNEZ R4,elsepart ;test A
… ;then part
SD …,0(R2) ;store to B
…
J join ;jump over else
elsepart: … ;else part
X ;code for X
…
join: … ;after if
SD …,0(R3) ;store to C
51
CONSTRAINTS
• Without affecting Data flow
• Exception(memory reference)
(moving assignments to B and C to earlier in

the execution sequence)
52
What are the relative execution frequencies
of the then case and the else case in the
branch? If the then case is much more
frequent, the code motion may be beneficial.
What is the cost of executing the

computation and assignment to B
above the branch? It may be that
there are a number of empty inst
issue slots in the code above the
branch
How will the movement of B change
the execution time for the then
case? If B is at the start of the
critical path for the then case,
moving it may be highly beneficial.
Is B the best code fragment that

can be moved above the branch? What is the cost of the compensation
How does it compare with moving code that may be necessary for the
C or another statements within the else case? How effectively can this
then case? code be scheduled, and what is its
impact on execution time? 53
• Trace scheduling: focusing on the critical
path
– Useful for processors with a large number of
issues per clock
– A way to organize the global code motion
process, so as to simplify the code scheduling
by incurring the costs of possible code motion
on the less frequent paths
54
• Two steps to trace scheduling
– Trace selection
tries to find a likely sequence of basic blocks
whose operations will be put together into a
smaller number of insts(trace)
– Trace compaction
– (once trace is selected)
tries to squeeze the trace into a small number
of wide insts
55
Fig 9 This trace is obtained by assuming that the program fragment in Fig 8 is
the inner loop and unwinding it four times, treating the shaded portion in Fig
8 as the likely path
56
TRACE SCHEDULING
DISADVANTAGE:
Entries and exits in the middle of the trace
requires some compensation code for
compilation
57
• Superblocks
– are formed by a process similar to that used
for traces
– but are a form of extended basic blocks,
which are restricted to a single entry point but
allow multiple exits(so compacting becomes
easier since only code motion across an exit
need to be considered)
58
Fig 10 This superblock results from unrolling the code in Fig 8 four
times and creating a superblock
59
Hardware support for exposing
more parallelism
• Techniques such as
-loop unrolling ,s/w pipelining ,Global and
trace scheduling improves ILP only when
the branch behaviors are predictable at
compile time.
60
4.5 Hardware support for exposing
more parallelism at compile time
• Conditional or predicated instructions
– An inst refers to a condition, which is
evaluated as part of the inst execution
– If the condition is true, the inst is executed
normally
– If the condition is false, the execution
continues as if the inst were no-op
61
4.5 Hardware support for exposing more parallelism at compile time
Conditional or predicated
instructions
Example Consider the following code:
if (A==0) {S=T;}
Assume that registers R1, R2, and R3 hold the values of
A, S, and T, respectively, show the code for this
statement with the branch and with the conditional move.
Answer A
BNEZ R1, L
ADDU R2,R3,R0 The straightforward code
L: S T
zero
The conditional instruction
CMOVZ R2,R3,R1
Conditional move that performs
the move only if R1=0 62
CONDITIONAL MOV
• Convert control dependence to data
dependence to eliminate the branch for
improving pipeline behavior.
63
Conditional or predicated instructions
Example Here is a code sequence for a 2-issue superscalar that
can issue a combination of one memory reference and one ALU
operation, or a branch by itself, every cycle:
First inst slot Second inst slot
LW R1,40(R2) ADD R3,R4,R5
ADD R6,R3,R7
BEQZ R10,L
LW R8,0(R10)
LW R9,0(R8)
This sequence waste a memory operation slot and will incur a data
dep stall if the branch is not taken, since the 2nd LW after the
branch depends on the prior load. Show how the code can be
improved using a predicated form of LW.
64
Conditional or predicated instructions
Answer
First inst slot Second inst slot
LW R1,40(R2) ADD R3,R4,R5
LWC R8,0(R10),R10 ADD R6,R3,R7
BEQZ R10,L Load occurs unless R10 is 0
LW R9,0(R8)
65
Compiler speculation with hardware support
Hardware support for preserving exception behavior
• 4 methods for supporting speculation without
introducing erroneous exception
– The H/W and OS cooperatively ignore exceptions for
speculative instructions
• Preserves exception behavior for correct programs, but not for
incorrect ones
– Speculative insts that never raise exceptions are used,
and checks are introduced to determine when an
exception should occur
– A set of status bits, called poison bits, are attached to
the result registers written by speculation insts when the
insts cause exception
– A mechanism is provided to indicate that an inst is
speculative, and the H/W buffers the inst result until it is
certain that the inst is no longer speculative
66
Example Consider the following code fragment from an if-
then-else statement of the form
if (A==0) A=B; else A=A+4;
where A is at 0(R3) and B is at 0(R2):
LD R1,0(R3) ;load A
BNEZ R1,L1 ;test A
LD R1,0(R2) ;then clause
J L2 ;skip else
L1: DADDI R1,R1,#4 ;else clause
L2: SD R1,0(R3) ;store A
Answer Here is the new code:
LD R1,0(R3) ;load A
LD R14,0(R2) ;speculative load B
BEQZ R1,L3 ;other branch of the if
DADDI R14,R1,#4 ;else clause
L3: SD R14,0(R3) ;nonspeculative store 67
Example Show how the previous example can be coded
using a speculative load (sLD) and a speculation check
inst (SPECCK) to completely preserve exception
behavior. Assume R14 is available.
Answer Here is the code:
LD R1,0(R3) ;load A
sLD R14,0(R2) ;speculative, no termination
BNEZ R1,L1 ;test A
SPECCK 0(R2) ;perform speculation check
J L2 ;skip else
L1: DADDI R1,R1,#4 ;else clause
L2: SD R14,0(R3) ;store A
68
Example Consider the code fragment from the original
example and show how it would be compiled with
speculative insts and poison bits. Show where an
exception for the speculative memory reference would
be reorganized. Assume R14 is available.
Answer Here is the code:
LD R1,0(R3) ;load A
sLD R14,0(R2) ;speculative load B
BEQZ R1,L3 ;
DADDI R14,R1,#4 ;
L3: SD R14,0(R3) ;exception for speculative LW
69
Hardware support for memory reference speculation
• A special instruction to check for address
conflicts
– It is left at the original location of the load
instruction
– Acts like a guardian
– The load is moved up across one or more
stores
70
Hardware support for memory reference speculation
• When a speculated load is executed
– The hardware saves the address of the
accessed memory location
– If a subsequent store changes the location
before the check instruction, then the
speculation has failed
• Speculation failure can be handled
– If only the load was speculated, then redo the
load at the point of the check instruction
– If additional insts that depend on the load
were also speculated, then redo them 71
4.6 Crosscutting issues: hardware versus software
trade-offs and limitations
• To speculate extensively, we must be able to
disambiguate memory reference.
– This capability is difficult to do at compile time for
integer programs that contains pointers.
– In a hardware-based scheme, dynamic run time
disambiguation of memory addresses is done using
the techniques we saw earlier for Tomasulo’s
algorithm.
– This disambiguation allows us to move loads past
stores at run time.
– Support for speculative memory references can help
overcome the conservatism of the compiler, but
unless such approaches are used carefully, the
overhead of the recovery mechanisms may swamp
the advantages. 72
• Hardware-based speculation works better
– when control flow is unpredictable,
– and when hardware-based branch prediction is
superior to software-based branch prediction done at
compile time.
– These properties hold for many integer programs.
– For example, a good static predictor has a
misprediction rate of about 16% for four major integer
SPC92 programs, and a hardware prediction has a
misprediction rate under 10%.
– Because speculated instructions may slow down the
computation when the prediction is incorrect, this
difference is significant.
– One result of this difference is that even statically
scheduled processors normally include dynamic
branch predictors.
73
• Hardware-based speculation maintains a
completely precise exception model even for
speculated instructions. Recent software-based
approaches have added special support to allow
this as well.
• Hardware-based speculation does not require
compensation or bookkeeping code, which is
needed by ambitious software speculation
mechanism.
• Compiler-based approaches may benefit from
the ability to see further in the code sequence,
resulting in better code scheduling than a purely
hardware-driven approach.
74
• Hardware-based speculation with dynamic
scheduling does not require different code
sequences to achieve good performance for
different implementations of an architecture.
– Although this advantage is the hardest to quantify, it
may be the most important in the long run.
– Interestingly, this was one of the motivations for the
IBM 360/91.
– On the other hand, more recent explicitly parallel
architectures, such as IA-64, have added flexibility
that reduces the hardware dependence inherent in a
code sequence.
75
4.7 Putting it all together: the Intel IA-64 architecture and
Itanium processor
The Intel IA-64 instruction set architecture
• The IA-64 register model
– 128 64-bit GPRs
– 128 82-bit FPRs, which provide two extra
exponent bits over the standard 80-bit IEEE
format
– 64 1-bit predicate registers
– 8 64-bit branch registers, which are used for
indirect branches
– A variety of registers used for system control,
memory mapping, performance counters, and
communication with the OS
76
Itanium processor
• Instruction format and support for explicit
parallelism
Fig 11 The five execution unit slots in the IA-64 architecture and
what instructions types they may hold are shown
Execution Instruction Instruction
unit slot type description Example instructions
A Integer ALU add, subtract, and, or,
compare
I-unit
I Non-ALU Integer and multimedia
integer shifts, bit tests, moves
A Integer ALU add, subtract, and, or,

compare
M-unit
M Memory access Loads and stores for
integer/FP registers
F-unit F Floating point Floating-point instructions
B-unit B Branches Conditional branches, calls,

loop branches
L+X L+X Extended Extended immediates, stops

and no-ops 77
Fig 12 The 24 possible template values (8 possible values are
reserved) and the instruction slots and stops for each format
template Slot 0 Slot 1 Slot 2
0 M I I
1 M I I
2 M I I
3 M I I
4 M L X
5 M L X
8 M M I
9 M M I
10 M M I
11 M M I
12 M F I
13 M F I
14 M M F
15 M M F
16 M I B
17 M I B
18 M B B
19 M B B
22 B B B
23 B B B
24 M M B
25 M M B
28 M F B
78
29 M F B
Itanium processor
Example Unroll the array increment example,
x[i]=x[i]+s, seven times and place the
instructions into bundles, first ignoring pipeline
latencies (to minimize the number of bundles)
and then scheduling the code to minimize stalls.
In scheduling the code assume one bundle
executes per clock and that any stalls cause the
entire bundle to be stalled. Use the pipeline
latencies from Fig 4.1. Use MIPS instruction
mnemonics for simplicity.
79
Answer
Fig 13 The IA-64 instructions, including bundle bits and stops, for the
unrolled version of x[i]=x[i]+s, when unrolled seven times and
scheduled (a) to minimize the number of instruction bundles
Bundle Execute cycle

template Slot 0 Slot1 Slot2 (1 bundle/cycle)
9: M M I L.D F0, 0(R1) L.D F6, -8(R1) 1
14: M M F L.D F10, L.D F14, -24(R1) ADD.D F4,F0,F2 3
-16(R1)
15: M M F L.D F18, L.D F22, -40(R1) ADD.D F8, F6, F2 4
-32(R1)
15: M M F L.D F26, S.D F4, 0(R1) ADD.D F12, F10, 6
-48(R1) F2
15: M M F S.D F16, S.D F12, -16(R1) ADD.D F16, F14, 9
-8(R1) F2
15: M M F S.D F16, ADD.D F20, F18, 12
-24(R1) F2
15: M M F S.D F20, ADD.D F24, F22 , 15
-32(R1) F2
15: M M F S.D F24, ADD.D F28, F26, 18
-40(R1) F2
12: M M F S.D F28, DADDUI R1, R1, #- BNE R1, R2, Loop 21
-48(R1) 56
80
Fig 13 The IA-64 instructions, including bundle bits and stops, for the
unrolled version of x[i]=x[i]+s, when unrolled seven times and
scheduled (b) to minimize the number of cycles (assuming that a
hazard stalls an entire bundle)
execute cycle
Bundle (1
template Slot 0 Slot1 Slot2 bundle/cycle)
8: M M I L.D F0, 0(R1) L.D F6, -8(R1) 1
9: M M I L.D F10, -16(R1) L.D F14, -24(R1) 2
14: M M F L.D F18, -32(R1) L.D F22, -40(R1) ADD.D F4,F0,F2 3
14: M M F L.D F26, -48(R1) ADD.D F8, F6, F2 4
15: M M F ADD.D F12, F10, F2 5
14: M M F S.D F4, 0(R1) ADD.D F16, F14, F2 6
14: M M F S.D F8, -8(R1) ADD.D F20, F18, F2 7
15: M M F S.D F12, -16(R1) ADD.D F24, F22 , F2 8
14: M M F S.D F16, -24(R1) ADD.D F28, F26, F2 9
9: M M I S.D F20, -32(R1) S.D F24, -40(R1) 11
8: M M I S.D F28, -48(R1) DADDUI R1, R1, #- BNE R1, R2, Loop 12
56
81
Itanium processor
• Instruction set basics
– Five primary instruction classes:
• A, I, M, F, and B
– Each IA-64 instruction is 41 bits in length
• The high-order 4 bits, together with the bundle bits
that specify the execution unit slot, are used as the
major opcode.
• The low-order 6 bits are used for specifying the
predicate register that guards the instruction
82
Fig 14 A summary of some of the instruction formats of the IA-64 ISA
I n s t ru ct i on N u m b e r of Re p re s e n t a t i v e E x t ra op cod e GP Rs / I m m e d ia t e O t h e r/com me n t
Ty p e f orm a t s i n s t ru ct i on s bits F P Rs b it s
A 8 Add, su btr a ct , 9 3 0
a n d, or
S h ift left a n d a dd 7 3 0 2-bit sh ift cou n t
ALU im m edia tes 9 2 8
Add imm edia t e 3 2 14
Add imm edia t e 0 2 22
Compa r e 4 2 0 2 pr edica te r egist er
dest in a tion s
Compa r e 3 1 8 2 pr edica te r egist er

imm edia t e dest in a tion s
I 29 S h ift R/L va r ia ble 9 3 0 Ma n y m u lt imedia

in st r u ct ion s u se t h is
for m a t.
Test bit 6 3 6-bit field 2 pr edica te r egist er
specifier dest in a tion s
Move t o BR 6 1 9-bit br a n ch Br a n ch r egist er
pr edict specifier
M 46 I n t eger /F P loa d 10 2 0 S pecu la t ive/n on spec
and st or e, lin e u la t ive
pr efetch
I n t eger /F P loa d 9 2 8 S pecu la t ive/n on spec
and st or e, and u la t ive
lin e pr efet ch a n d
post in cr emen t by
imm edia t e
I n t eger /F P loa d 10 3 S pecu la t ive/n on spec
pr efetch and u la t ive
r egist er
post in cr emen t
I n t eger /F P 3 1 21 in two fields
specu la tion ch eck
B 9 P C-r ela t ive 7 0 21
br a n ch , cou n t ed
br a n ch
P C-r ela t ive ca ll 4 0 21 1 br a n ch r egist er
F 15 F P a r it h met ic 2 4
F P compa r e 2 2 2 6-bit pr edica t e
L+X 4 Move im m edia t e 2 1 64 r egs 83
lon g
Itanium processor
• Predication support
– An instruction is predicated by specifying a
predicate register, whose identity is placed in
the lower 6 bits of each instruction field.
– Nearly every instruction in the IA-64
architecture can be predicated
– Both if conversion and code motion have
lower overhead
84
Itanium processor
• Speculation support
– Support for control speculation, which deals
with deferring exception for speculated
instructions
– Support for mem reference speculation, which
supports speculation of load instructions
• Deferred exception handling
– Supported by providing the equivalent of
poison bits
85
Itanium processor
• Deferred exception handling
– For the GPRs, these bits are called NaTs (Not
a Thing), and this extra bit makes 65 bits wide
– For the FPRs, this capability is obtained using
a special value, NaTVal (Not a Thing Value)
• This value is encoded using a significant of 0 and
an exponent outside of the IEEE rage
– Only speculative load generate such values,
but all instructions that do not affect mem will
cause a Nat or NatVal to be propagated to the
result register
86
Itanium processor
• Memory reference support in the IA-64
– Use a concept called advanced loads
• A load that has been speculatively moved above
stores on which it is potentially dependent
– ld.a – for advanced load
• Executing this inst creates an entry in a special
table, called the ALAT
• The ALAT stores both the reg destination of the
load and the addr of the accessed mem location
87
Itanium processor
The Itanium processor
• The Itanium processor
– The first implementation of the IA-64
– Available in mid-2001 with an 800 MHz
– Up to 6 issues per clock
• Up to 3 branches and 2 mem references
– 3-level cache
• 1st-level: split caches, FP data are not in it
• 2nd-level: unified
• 3rd-level: unified, off-chip 4 MB
88
Itanium processor
• 9 Functional units
– 2 I-units, 2 M-units, 3 B-units, 2 F-units
Fig 15 The latency of some typical instructions on Itanium
Instruction Latency
Integer load 1
Floating-point load 9
Correctly predicted taken branch 0-3
Mispredicted branch 9
Integer ALU operations 0
FP arithmetic 4
89
Itanium processor
• Instruction issue
– It has an instruction issue window that
contains up to 2 bundles
– It can issue up to 6 instructions in a clock
90
Itanium processor
• 10-satge pipeline
– Front-end (stages IPG, Fetch, and Route)
• Prefetches up to 32 bytes per clock (2 bundles)
• Prefetch buffer can hold up to 8 bundles (24 insts)
– Instruction delivery (stages EXP and REN)
• Distributes up to 6 insts to the 9 FUs
• Implements reg renaming for both rotation and reg stacking
– Operand delivery (WLD and REG)
• Accesses the reg file, performs reg bypassing, accesses and
updates a reg scoreboard, and checks predicate
dependences
– Execution (EXE, DET, and WRB)
• Executes insts through ALUs and load-store units, detects
exceptions and posts NaTs, retires insts, and performs write91
back
Fig 16 The SPECint benchmark set shows that the Itanium is
considerably slower than either the Alpha 21264 or the Pentium 4
92
Fig 17 The SPECfp benchmark set shows that the Itanium is somewhat
faster than either the Alpha 21264 or the Pentium 4
93
4.8 Another view: ILP in the embedded and mobile markets
Trimedia and Crusoe
• The Trimedia TM32 architecture
– It is a classic VLIW architecture
– Every instruction contains 5 operations
– It is completely statically scheduled
94
Trimedia and Crusoe
Example First compile the loop for the following c code into
MIPS instructions, and then show what it might look like
if the Trimedia processor’s operations fields were the
same as MIPS instructions. (In fact, the Trimedia
operation types are very close to MIPS instructions in
capability.) Assume the functional unit capacities and
latencies shown in Figure 18.
void sum (int a[], int b[], int c[], int n)
{ int I;
for (i=0; i<n; i++)
c[i] = a[i] + b[i];
}
Unroll the loop so there are up to 4 copies of the body, if
needed.
Answer Figure 19 & 20. 95
Fig 18 There are 23 functional units of 11 different types in
the Trimedia CPU
　　 Operation slots 　
Functional Unit typical operations performed by functional

1 2 3 4 5
unit latency unit
ALU 0 yes yes yes yes yes integer add/subtract/compare, logicals
DMem 2 　　　 yes yes loads and stores
DMemSpec 2 　　　　 yes cache invalidate, prefetch, allocate
Shifter 0 yes 　 yes 　　 shifts and rotates
DSPALU 1 yes 　 yes 　　 simple DSP arithmetic operations
DSPMul 2 　 yes yes 　　 DSP operations with multiplication
Branch 3 　 yes yes yes 　 branches and jumps
FALU 2 yes 　　 yes 　 FP add, subtract
IFMul 2 　 yes yes 　　 integer and FP multiply
FComp 0 　　 yes 　　 FP compare
Ftough 16 　 yes 　　　 FP divide, suqare root 96

Fig 19 The MIPS code for the integer vector sum shown in part (a)
before unrolling and in part (b) after unrolling 4 times
Loop: LD R11,R0(R4) # R11 = a[I]
LD R12,R0(R5) #R12 = b[I]
DADDU R17,R11,R12 # R17 = a[I] + b[I]
SD R17,0(R6) # c[I] = a[I]+b[I]
DADDUI R4,R4,8 # R4 = next a[] address
DADDUI R5,R5,8 # R5 = next b[] address
DADDUI R6,R6,8 # R6 = next c[] address
BNE R4,R7,Loop # if not last go to Loop
(a)The MIPS code before unrolling
Loop: LD R11,0(R4) # load a[I]
LD R12,R0(R5) # load b[I]
DADDU R17,R11,R12 # load b[I]
SD R17,0(R6) # c[I] = a[I]+b[I]
LD R14,8(R4) # load a[I]
LD R15,8(R5) # load b[I]
DADDU R18,R14,R15 # a[I] + b[I]
SD R18,8(R6) # c[I] = a[I]+b[I]
DADDU R21,R19,R20 # a[I] + b[I]
SD R21,16(R6) # c[I] = a[I]+b[I]
DADDU R24,R22,R23 # a[I] + b[I]
SD R24,24(R6) # c[I] = a[I]+b[I]
DADDIU R4,R4,32 # R4 = next a[] address
DADDIU R5,R5,32 # R5 = next b[] address
DADDIU R6,R6,32 # R6 = next c[] address
BNE R4,R7,Loop # if not last go to Loop
(b)The MIPS code after unrolling four times and optimizing the code but not
scheduling it(for simplicity, we have assumed that n is a multiple of four) 97
Fig 20 The Trimedia code for a simple loop summing two vectors to
generate a third makes good use of multiple memory ports bus still
contains a large fraction of idle slots
Slot 1 Slot 2 Slot 3 Slot 4 Slot 5
　　　 LD R11,0(R4) LD R12,R0(R5)
DADDUI R25,R6,32 　　 LD R14,8(R4) LD R15,8(R5)
SETEQ R25,R25,R7 　　 LD R19,16(R4) LD R20,16(R5)
DADDU R17,R11,R12 DADDIU R4,R4,32 　 LD R22,24(R4) LD R23,24(R5)
DADDU R18,R14,R15 JMPF R25,R30 　 SD R17,0(R6) 　
DADDU R21,R19,R20 DADDIU R5,R5,32 　 SD R18,8(R6) 　
DADDU R24,R22,R23 　　 SD R21,16(R6) 　
DADDIU R6,R6,32 　　 SD R24,24(R6) 　
98
Fig 21 The performance and the code size for the EEMBC consumer
benchmarks run on the Trimedia TM1300 and the NEC VR5000 and
shown relative to the performance and code size for the low-end NEC
VR4122.
99
Trimedia and Crusoe
• The Transmeta Crusoe processor
– A VLIW processor designed for low-power
– Instruction set compatibility with the x86
• Through a software system that translates from
x86 to the VLIW
– In-order execution
– Instruction size:
• 64 bits – 2 operations
• 128 bits – 4 operations
100
Trimedia and Crusoe
• The Crusoe 5 types of operation slots
– ALU: typical RISC ALU operations
– Compute: this slot may specify any integer
ALU (there are 2 integer ALUs), a FP
operations, or a multimedia operation
– Memory: a load or store
– Branch: a branch
– Immediate: a 32-bit immediate used by
another operation in the instruction
101
Trimedia and Crusoe
Fig 22 The energy performance of the processor and
memory interface modules using two multimedia
benchmarks is shown for the Mobile Pentium III and the
Transmeta 3200.
　 Power consumption for the workload(W) 　
Mobile Pentium III Relative consumption

Workload TM 3200 @ 400 MHz
@ 500 MHz, TM 3200/Mobile
description 1.5V
1.6V Pentium III
MP3 playback 0.672 0.214 0.32
DVD playback 1.13 0.479 0.42
102
Trimedia and Crusoe
Fig 23 Power distribution inside a laptop during DVD playback shows
that the processor subsystem consumes only 20% of the power!
Major Percent of total

Component Power(W)
system power
Low-power Pentium III 0.8 8%

Processor interface/memory
0.65 6%
Processor controller
Memory 0.1 1%
Graphics 0.5 5%
Hard drive 0.65 6%
DVD drive 2.51 24%
I/O Audio 0.5 5%
Control and other 1.3 12%
TFT display 2.8 27%
Power Power supply 0.72 7%
Total 　 10.43 100% 103
4.9 Fallacies and pitfalls
• Fallacy: There is a simple approach to
multiple-issue processors that yields high
performance without a significant
investment in silicon area or design
complexity.
104

Exploiting ILP With Software Approach

Uploaded by

Copyright:

Available Formats

Exploiting ILP With Software Approach

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exploiting ILP With Software Approach

Uploaded by

Copyright:

Available Formats

Chapter 4

Exploiting ILP with Software Approach

Instruction Instruction using Latency in clock

FP ALU op Store double 2

Load double FP ALU op 1

Load double Store double 0

for (i=1000; i>0; i=i-1)

Eliminated three branches and three decrements of R1

14 clock cycles, or 3.5 per iteration

Summary of the loop unrolling and

Summary of the loop unrolling and

L.D F10, -16 (R1) L.D F14, -24 (R1)

F12, F10, F16, F14,

S.D F20, 24 (R1) S.D F24, 16 (R1)

S.D F28, 8 (R1) 19

Detecting and enhancing loop-level

Detecting and enhancing loop-level

(iii)GCD test example)

• Comparison with loop unrolling

(moving assignments to B and C to earlier in

What is the cost of executing the

Is B the best code fragment that

A Integer ALU add, subtract, and, or,

B-unit B Branches Conditional branches, calls,

L+X L+X Extended Extended immediates, stops

Bundle Execute cycle

Compa r e 3 1 8 2 pr edica te r egist er

I 29 S h ift R/L va r ia ble 9 3 0 Ma n y m u lt imedia

Correctly predicted taken branch 0-3

Integer ALU operations 0

Functional Unit typical operations performed by functional

ALU 0 yes yes yes yes yes integer add/subtract/compare, logicals

DMem 2 yes yes loads and stores

DMemSpec 2 yes cache invalidate, prefetch, allocate

Shifter 0 yes yes shifts and rotates

DSPALU 1 yes yes simple DSP arithmetic operations

DSPMul 2 yes yes DSP operations with multiplication

Branch 3 yes yes yes branches and jumps

FALU 2 yes yes FP add, subtract

IFMul 2 yes yes integer and FP multiply

FComp 0 yes FP compare

Ftough 16 yes FP divide, suqare root 96

Slot 1 Slot 2 Slot 3 Slot 4 Slot 5

DADDUI R25,R6,32 LD R14,8(R4) LD R15,8(R5)

SETEQ R25,R25,R7 LD R19,16(R4) LD R20,16(R5)

DADDU R17,R11,R12 DADDIU R4,R4,32 LD R22,24(R4) LD R23,24(R5)

DADDU R18,R14,R15 JMPF R25,R30 SD R17,0(R6)

DADDU R21,R19,R20 DADDIU R5,R5,32 SD R18,8(R6)

DADDU R24,R22,R23 SD R21,16(R6)

DADDIU R6,R6,32 SD R24,24(R6)

Power consumption for the workload(W)

Mobile Pentium III Relative consumption

MP3 playback 0.672 0.214 0.32

DVD playback 1.13 0.479 0.42

Major Percent of total

Low-power Pentium III 0.8 8%

You might also like

DMem 2 　　　 yes yes loads and stores

DMemSpec 2 　　　　 yes cache invalidate, prefetch, allocate

Shifter 0 yes 　 yes 　　 shifts and rotates

DSPALU 1 yes 　 yes 　　 simple DSP arithmetic operations

DSPMul 2 　 yes yes 　　 DSP operations with multiplication

Branch 3 　 yes yes yes 　 branches and jumps

FALU 2 yes 　　 yes 　 FP add, subtract

IFMul 2 　 yes yes 　　 integer and FP multiply

FComp 0 　　 yes 　　 FP compare

Ftough 16 　 yes 　　　 FP divide, suqare root 96

DADDUI R25,R6,32 　　 LD R14,8(R4) LD R15,8(R5)

SETEQ R25,R25,R7 　　 LD R19,16(R4) LD R20,16(R5)

DADDU R17,R11,R12 DADDIU R4,R4,32 　 LD R22,24(R4) LD R23,24(R5)

DADDU R18,R14,R15 JMPF R25,R30 　 SD R17,0(R6) 　

DADDU R21,R19,R20 DADDIU R5,R5,32 　 SD R18,8(R6) 　

DADDU R24,R22,R23 　　 SD R21,16(R6) 　

DADDIU R6,R6,32 　　 SD R24,24(R6) 　

　 Power consumption for the workload(W)