Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Branch Prediction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Branch Prediction

Delayed Branch Technique


Delayed Branch Technique
Delayed Branch Technique
Delayed Branch Technique
Delayed Branch Technique
Delayed Branch Technique: From Before
Delayed Branch Technique: From Target
Delayed Branch Technique: From Fall-Through
Delayed Branch Technique
Delayed Branch Technique
Static Prediction
Does not take into account the run-time history of the
particular branch instruction whether it was taken or not
taken recently, how often it was taken or not taken, etc.
Simplest static prediction:
predict always taken
predict always not taken
More complex static prediction:
performed at compile time by analyzing the program
Dynamic Hardware Branch Prediction
Dynamic prediction:
1-bit Predictor
Branch- prediction buffer or branch history table (BHT) is a
cache indexed by a fixed lower portion of the address of the
branch instruction
1-bit prediction: for each index the BHT contains one prediction
bit (also called history bit) that says if the branch was last taken
or not prediction is that branch will do the same again

1 prediction bit

0
a31a30a11a2a1a0 branch instruction

1K-entry BHT

10-bit index

Instruction memory
Dynamic prediction:1-bit Predictor
Meaning of prediction bit
1 = branch was last taken
0 = branch was last not taken
Using the BHT
index into the BHT and use the prediction bit to predict branch
behavior
note the prediction bit may have been set by a different branch
instruction with the same lower address bits but that does not matter
the history bit is simply a hint
if prediction is wrong, invert prediction bit
Example: Consider a loop branch that is taken 9 times in a row and
then not taken once. What is the prediction accuracy of 1-bit predictor
for this branch assuming only this branch ever changes its
corresponding prediction bit?
Answer: 80%. Because there are two mispredictions one on the
first iteration and one on the last iteration. Why?
Dynamic prediction: 2-bit Predictor
2-bit prediction: for each index the BHT contains two prediction
bits that change as in the figure below
Key idea: the prediction must be wrong twice for it to be
changed
Example: What is the prediction accuracy of a 2-bit predictor on the
loop of the previous example?
2-bit Predictor Statistics

Prediction accuracy of 4K-entry 2-bit prediction buffer on SPEC89 benchmarks:


accuracy is lower for integer programs (gcc, espresso, eqntott, li) than for FP
2-bit Predictor Statistics

Prediction accuracy of 4K-entry 2-bit prediction buffer vs. infinite 2-bit buffer:
increasing buffer size from 4K does not significantly improve performance
n-bit Predictors

Use an n-bit counter which, therefore, represents a value X


where 0 X 2n 1
increment X if branch is taken (to a max of 2n)
decrement X if branch is not taken (to a min of 0)
If X 2n-1, then predict taken; otherwise, untaken
Studies show that there is no significant improvement in
performance using n-bit predictors with n > 2, so 2-bit
predictors are implemented in most systems
Correlating Predictors
Correlating Predictors
if (aa == 2) DSUBUI R3, R1, #2
aa = 0; BNEZ R3, L1 ; branch b1 (aa != 2)
if (bb == 2) DADD R1, R0, R0 ; aa = 0
L1: DSUBUI R3, R2, #2
bb=0;
BNEZ R3, L2 ; branch b2 (bb != 2)
if (aa! = bb) { DADD R2, R0, R0 ; bb = 0
L2: DSUB R3, R2, R1 ; R3 = aa bb
BEQZ R3, L3 ; branch b3 (aa == bb)
Code fragment from Corresponding MIPS code:
eqntott SPEC89 benchmark aa is in R1, bb is in R2

Key idea: branch b3 behavior is correlated with the behavior of


branches b1 and b2
because if branches b1 and b2 are both not taken, then the
statements following the branches will set aa=0 and bb=0
b3 will be taken
Correlating Predictors:
Simple Example
if (d == 0) BNEZ R1, L1 ; branch b1 (d != 0)
d = 1; DADDIU R1, R0, #1 ; d==0, so d=1
if (d == 1) { L1: DADDIU R3, R1, #-1
BNEZ R3, L2 ; branch b2 (d != 1)

L2
Simple code Corresponding MIPS code:
fragment d is in R1

Initial Value Values of d


of d d==0? b1 before b2 d==1? b2
0 yes not taken 1 yes not taken
1 no taken 1 yes not taken
2 no taken 2 no taken
Possible execution sequences assuming d is one of 0, 1, or 2
Correlating Predictors:
Impact of Ignoring Correlation
Initial Value Values of d
of d d==0? b1 before b2 d==1? b2
0 yes not taken 1 yes not taken
1 no taken 1 yes not taken
2 no taken 2 no taken
Possible execution sequences assuming d is one of 0, 1, or 2

d= b1 b1 new b1 b2 b2 new b2
prediction action prediction prediction action prediction

2 NT T T NT T T

0 T NT NT T NT NT

2 NT T T NT T T

0 T NT NT T NT NT

Behavior of 1-bit predictor initialized to not taken with d alternating


between 2 and 0: 100% misprediction!
Correlating Predictors:
Taking Correlation into Account
Prediction bits Prediction if last branch not taken Prediction if last branch taken
NT/NT NT NT
NT/T NT T
T/NT T NT
T/T T T
Meaning of 1-bit predictor with 1 bit of correlation: equivalent to assuming two separate
prediction bits one assuming last branch executed was not taken and one assuming
the last branch executed was taken

d= b1 b1 new b1 b2 b2 new b2
prediction action prediction prediction action prediction

2 NT/NT T T/NT NT/NT T NT/T

0 T/NT NT T/NT NT/T NT NT/T

2 T/NT T T/NT NT/T T NT/T

0 T/NT NT T/NT NT/T NT NT/T

Behavior of 1-bit predictor with 1-bit of correlation, assuming initially NT/NT and d
alternating between 0 and 2: mispredictions only on first iteration.
Predictions used in red.
Correlating Predictors:
(m,n) Predictors
The correlating predictor as before 1 bit of prediction plus 1
correlating bit is called a (1,1) predictor
Generalization of the (1,1) predictor is the (m,n) predictor
(m,n) predictor : use the behavior of the last m branches to
choose from one of 2m branch predictors, each of which is an
n-bit predictor
The history of the most recent m branches is recorded in an
m-bit shift register called the m-bit global history register
shift in the behavior bit for the most recent branch, shift out the
the bit for the least recent branch
Index into the BHT by concatenating the lower bits of the
branch instruction address with the m-bit global history to
access an n-bit entry
(2, 2) Correlating Branch Predictors
Example of (2, 2) Correlating Predictor
Example of (2, 2) Correlating
Predictor
Accuracy of Correlating
Predictors
Accuracy of Correlating Predictors
Simple Example
Note : A 2-bit predictor with no global history is simply a (0,2)
predictor

How many bits are in a (0,2) predictor with 4K entries?


Total bits = 20 2 4K = 8K
How many bits are in a (2,2) predictor with 16 entries (shown in
previous figure)?
Total bits = 64 2 = 128
How many branch-selected entries are in a (2,2) predictor that has a
total of 8K bits in the BHT? How many bits of the branch address are
used to access the BHT
Total bits = 8K = 22 2 no. prediction entries selected by branch
Therefore, no. prediction entries selected by branch = 1K
Therefore, no. of bits of branch address used to index BHT = 10
Tournament Predictors
Motivation for correlating branch predictors:
2-bit local predictor failed on important
branches; by adding global information,
performance improved
Tournament predictors: use two predictors, 1
based on global information and 1 based on
local information, and combine with a selector
Hopes to select right predictor for right
branch (or right context of branch)
Tournament Predictor in Alpha 21264
4K 2-bit counters to choose from among a global
predictor and a local predictor
Global predictor also has 4K entries and is indexed by
the history of the last 12 branches; each entry in the
global predictor is a standard 2-bit predictor
12-bit pattern: ith bit is 0 => ith prior branch not taken;
ith bit is 1 => ith prior branch taken;
Here c1/c2 means: correctness of predictor 1 /
correctness of predictor 2
00,10,11 00,01,11
1
Use 2 2
Use 1
3 4K 2
10 01 01 10 .. bits
01 .
Use 1 Use 2
10 12
00,11 00,11
Tournament Predictor in Alpha 21264
Local predictor consists of a 2-level predictor:
Top level a local history table consisting of 1024 10-bit
entries; each 10-bit entry corresponds to the most recent 10
branch outcomes for the entry. 10-bit history allows patterns
10 branches to be discovered and predicted. Indexed by local
branch address.
Next level Selected entry from the local history table is used
to index a table of 1K entries consisting a 3-bit saturating
counters, which provide the local prediction
Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits!
(~180K transistors)
1K
1K 10 3
bits
bits
% of predictions from local predictor
in Tournament Prediction Scheme
0% 20% 40% 60% 80% 100%

nasa7 98%
matrix300 100%
tomcatv 94%
doduc 90%
spice 55%
fpppp 76%
gcc 72%

espresso 63%
eqntott 37%
li 69%
Accuracy of Branch Prediction
99%
tomcatv 99%
100%
95%
doduc 84%
97%
86%
fpppp 82%
98% Profile-based
2-bit counter
88% Tournament
li 77%
98%
86%
espresso 82%
96%
88%
gcc 70%
94%

0% 20% 40% 60% 80% 100%


Profile: branch profile from last execution
Accuracy v. Size (SPEC89)
10%
Conditional branch misprediction rate

9%

8%

7% Local - 2 bit counters


6%

5%
4%
Correlating - (2,2) scheme
3%

2% Tournament
1%

0%
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Total predictor size (Kbits)


Need Address
at Same Time as Prediction
Branch Target Buffer (BTB): Address of branch used as index to
get prediction AND branch address (if taken)
Note: must check for branch match now, since cant use wrong branch
address

Branch PC Predicted PC
FETCH
PC of instruction

=? Yes: instruction is
branch; use Prediction state
predicted PC as bits
No: branch not predicted;
proceed normally (PC+4) next PC (if
predict Taken)
Branch Target Cache
Branch Target cache - Only predicted taken branches
Cache - Content Addressable Memory (CAM) or Associative
Memory (see figure)
Use a big Branch History Table & a small Branch Target Cache

Branch PC Predicted PC

PC

=? Prediction state
Yes: predicted taken bits (optional)
No: not found branch found
Steps with Branch target Buffer
Send PC to memory for the 5-stage MIPS
and branch-target
buffer

IF
No Entry found Yes
in branch-
target
buffer?

Is Send out predicted


No instruction Yes PC
a taken
ID branch?
Taken
No Yes
Normal Branch?
instruction
execution

Enter branch Mispredicted


Branch correctly
instruction address branch, kill fetched predicted; continue
EX instruction; restart
and next PC into execution with no
branch-target fetch at other stalls
buffer target; delete entry
from target buffer

You might also like