8 - Branch Prediction

Branch Prediction
Static Prediction
 Does not take into account the run-time history of the
particular branch instruction – whether it was taken or not
taken recently, how often it was taken or not taken, etc.
 Simplest static prediction:
 predict always taken
 predict always not taken
 More complex static prediction:
 performed at compile time by analyzing the program…
Dynamic Hardware Branch Prediction
Dynamic prediction:
1-bit Predictor
 Branch- prediction buffer or branch history table (BHT) is a
cache indexed by a fixed lower portion of the address of the
branch instruction
 1-bit prediction: for each index the BHT contains one prediction
bit (also called history bit) that says if the branch was last taken
or not – prediction is that branch will do the same again
1 prediction bit
0
a31a30…a11…a2a1a0 branch instruction
1K-entry BHT
10-bit index
Instruction memory
Dynamic prediction:1-bit Predictor
 Meaning of prediction bit
 1 = branch was last taken
 0 = branch was last not taken
 Using the BHT
 index into the BHT and use the prediction bit to predict branch
behavior
 note the prediction bit may have been set by a different branch
instruction with the same lower address bits but that does not matter –
the history bit is simply a hint
 if prediction is wrong, invert prediction bit
 Example: Consider a loop branch that is taken 9 times in a row and
then not taken once. What is the prediction accuracy of 1-bit predictor
for this branch assuming only this branch ever changes its
corresponding prediction bit?
 Answer: 80%. Because there are two mispredictions – one on the
first iteration and one on the last iteration. Why?
Dynamic prediction: 2-bit Predictor
 2-bit prediction: for each index the BHT contains two prediction
bits that change as in the figure below
 Key idea: the prediction must be wrong twice for it to be
changed
 Example: What is the prediction accuracy of a 2-bit predictor on the
loop of the previous example?
2-bit Predictor Statistics
Prediction accuracy of 4K-entry 2-bit prediction buffer on SPEC89 benchmarks:

accuracy is lower for integer programs (gcc, espresso, eqntott, li) than for FP
2-bit Predictor Statistics
Prediction accuracy of 4K-entry 2-bit prediction buffer vs. “infinite” 2-bit buffer:
increasing buffer size from 4K does not significantly improve performance
n-bit Predictors
 Use an n-bit counter which, therefore, represents a value X

where 0  X  2n – 1
 increment X if branch is taken (to a max of 2n)
 decrement X if branch is not taken (to a min of 0)
 If X  2n-1, then predict taken; otherwise, untaken
 Studies show that there is no significant improvement in
performance using n-bit predictors with n > 2, so 2-bit
predictors are implemented in most systems
Correlating Predictors
Correlating Predictors
if (aa == 2) DSUBUI R3, R1, #2
aa = 0; BNEZ R3, L1 ; branch b1 (aa != 2)
if (bb == 2) DADD R1, R0, R0 ; aa = 0
L1: DSUBUI R3, R2, #2
bb=0;
BNEZ R3, L2 ; branch b2 (bb != 2)
if (aa! = bb) {… DADD R2, R0, R0 ; bb = 0
L2: DSUB R3, R2, R1 ; R3 = aa – bb
BEQZ R3, L3 ; branch b3 (aa == bb)
Code fragment from Corresponding MIPS code:
eqntott SPEC89 benchmark aa is in R1, bb is in R2
 Key idea: branch b3 behavior is correlated with the behavior of

branches b1 and b2
 because if branches b1 and b2 are both not taken, then the
statements following the branches will set aa=0 and bb=0
 b3 will be taken
Correlating Predictors:
Simple Example
if (d == 0) BNEZ R1, L1 ; branch b1 (d != 0)
d = 1; DADDIU R1, R0, #1 ; d==0, so d=1
if (d == 1) { L1: DADDIU R3, R1, #-1
BNEZ R3, L2 ; branch b2 (d != 1)
…
L2
Simple code Corresponding MIPS code:
fragment d is in R1
Initial Value Values of d

of d d==0? b1 before b2 d==1? b2
0 yes not taken 1 yes not taken
1 no taken 1 yes not taken
2 no taken 2 no taken
Possible execution sequences assuming d is one of 0, 1, or 2
Impact of Ignoring Correlation
Initial Value Values of d
of d d==0? b1 before b2 d==1? b2
0 yes not taken 1 yes not taken
1 no taken 1 yes not taken
2 no taken 2 no taken
Possible execution sequences assuming d is one of 0, 1, or 2
d= b1 b1 new b1 b2 b2 new b2
prediction action prediction prediction action prediction
2 NT T T NT T T
0 T NT NT T NT NT
2 NT T T NT T T
0 T NT NT T NT NT
Behavior of 1-bit predictor initialized to not taken with d alternating

between 2 and 0: 100% misprediction!
Taking Correlation into Account
Prediction bits Prediction if last branch not taken Prediction if last branch taken
NT/NT NT NT
NT/T NT T
T/NT T NT
T/T T T
Meaning of 1-bit predictor with 1 bit of correlation: equivalent to assuming two separate
prediction bits – one assuming last branch executed was not taken and one assuming
the last branch executed was taken
d= b1 b1 new b1 b2 b2 new b2
prediction action prediction prediction action prediction
2 NT/NT T T/NT NT/NT T NT/T
0 T/NT NT T/NT NT/T NT NT/T
2 T/NT T T/NT NT/T T NT/T
0 T/NT NT T/NT NT/T NT NT/T
Behavior of 1-bit predictor with 1-bit of correlation, assuming initially NT/NT and d
alternating between 0 and 2: mispredictions only on first iteration.
Predictions used in red.
(m,n) Predictors
 The correlating predictor as before – 1 bit of prediction plus 1
correlating bit – is called a (1,1) predictor
 Generalization of the (1,1) predictor is the (m,n) predictor
 (m,n) predictor : use the behavior of the last m branches to
choose from one of 2m branch predictors, each of which is an
n-bit predictor
 The history of the most recent m branches is recorded in an
m-bit shift register called the m-bit global history register
 shift in the behavior bit for the most recent branch, shift out the
the bit for the least recent branch
 Index into the BHT by concatenating the lower bits of the
branch instruction address with the m-bit global history to
access an n-bit entry
(2, 2) Correlating Branch Predictors
Example of (2, 2) Correlating Predictor
Example of (2, 2) Correlating
Predictor
Accuracy of Correlating
Predictors
Accuracy of Correlating Predictors
Tournament Predictors
 Motivation for correlating branch predictors:
2-bit local predictor failed on important
branches; by adding global information,
performance improved
 Tournament predictors: use two predictors, 1
based on global information and 1 based on
local information, and combine with a selector
 Hopes to select right predictor for right
branch (or right context of branch)
Tournament Predictor in Alpha 21264
 4K 2-bit counters to choose from among a global
predictor and a local predictor
 Global predictor also has 4K entries and is indexed by
the history of the last 12 branches; each entry in the
global predictor is a standard 2-bit predictor
 12-bit pattern: ith bit is 0 => ith prior branch not taken;
ith bit is 1 => ith prior branch taken;
 Here c1/c2 means: correctness of predictor 1 /
correctness of predictor 2
00,10,11 00,01,11
1
Use 2 2
Use 1
3 4K  2
10 01 01 10 .. bits
01 .
Use 1 Use 2
10 12
00,11 00,11
Tournament Predictor in Alpha 21264
 Local predictor consists of a 2-level predictor:
 Top level a local history table consisting of 1024 10-bit
entries; each 10-bit entry corresponds to the most recent 10
branch outcomes for the entry. 10-bit history allows patterns
10 branches to be discovered and predicted. Indexed by local
branch address.
 Next level Selected entry from the local history table is used
to index a table of 1K entries consisting a 3-bit saturating
counters, which provide the local prediction
 Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits!
(~180K transistors)
1K 
1K  10 3
bits
bits
% of predictions from local predictor
in Tournament Prediction Scheme
0% 20% 40% 60% 80% 100%
nasa7 98%
matrix300 100%
tomcatv 94%
doduc 90%
spice 55%
fpppp 76%
gcc 72%
espresso 63%
eqntott 37%
li 69%
Accuracy of Branch Prediction
99%
tomcatv 99%
100%
95%
doduc 84%
97%
86%
fpppp 82%
98% Profile-based
2-bit counter
88% Tournament
li 77%
98%
86%
espresso 82%
96%
88%
gcc 70%
94%
0% 20% 40% 60% 80% 100%

 Profile: branch profile from last execution
Accuracy v. Size (SPEC89)
10%
Conditional branch misprediction rate
9%
8%
7% Local - 2 bit counters

6%
5%
4%
Correlating - (2,2) scheme
3%
2% Tournament
1%
0%
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Total predictor size (Kbits)

Need Address
at Same Time as Prediction
 Branch Target Buffer (BTB): Address of branch used as index to
get prediction AND branch address (if taken)
 Note: must check for branch match now, since can’t use wrong branch
address
Branch PC Predicted PC
FETCH
PC of instruction
=? Yes: instruction is
branch; use Prediction state
predicted PC as bits
No: branch not predicted;
proceed normally (PC+4) next PC (if
predict Taken)
Branch Target “Cache”
 Branch Target cache - Only predicted taken branches
 “Cache” - Content Addressable Memory (CAM) or Associative
Memory (see figure)
 Use a big Branch History Table & a small Branch Target Cache
Branch PC Predicted PC
PC
=? Prediction state
Yes: predicted taken bits (optional)
No: not found branch found

8 - Branch Prediction

Uploaded by

Copyright:

Available Formats

8 - Branch Prediction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

8 - Branch Prediction

Uploaded by

Copyright:

Available Formats

Branch Prediction

Prediction accuracy of 4K-entry 2-bit prediction buffer on SPEC89 benchmarks:

 Use an n-bit counter which, therefore, represents a value X

 Key idea: branch b3 behavior is correlated with the behavior of

Initial Value Values of d

Behavior of 1-bit predictor initialized to not taken with d alternating

2 NT/NT T T/NT NT/NT T NT/T

0 T/NT NT T/NT NT/T NT NT/T

2 T/NT T T/NT NT/T T NT/T

0 T/NT NT T/NT NT/T NT NT/T

0% 20% 40% 60% 80% 100%

7% Local - 2 bit counters

Total predictor size (Kbits)

You might also like