8 - Branch Prediction
8 - Branch Prediction
8 - Branch Prediction
Static Prediction
Does not take into account the run-time history of the
particular branch instruction – whether it was taken or not
taken recently, how often it was taken or not taken, etc.
Simplest static prediction:
predict always taken
predict always not taken
More complex static prediction:
performed at compile time by analyzing the program…
Dynamic Hardware Branch Prediction
Dynamic prediction:
1-bit Predictor
Branch- prediction buffer or branch history table (BHT) is a
cache indexed by a fixed lower portion of the address of the
branch instruction
1-bit prediction: for each index the BHT contains one prediction
bit (also called history bit) that says if the branch was last taken
or not – prediction is that branch will do the same again
1 prediction bit
0
a31a30…a11…a2a1a0 branch instruction
1K-entry BHT
10-bit index
Instruction memory
Dynamic prediction:1-bit Predictor
Meaning of prediction bit
1 = branch was last taken
0 = branch was last not taken
Using the BHT
index into the BHT and use the prediction bit to predict branch
behavior
note the prediction bit may have been set by a different branch
instruction with the same lower address bits but that does not matter –
the history bit is simply a hint
if prediction is wrong, invert prediction bit
Example: Consider a loop branch that is taken 9 times in a row and
then not taken once. What is the prediction accuracy of 1-bit predictor
for this branch assuming only this branch ever changes its
corresponding prediction bit?
Answer: 80%. Because there are two mispredictions – one on the
first iteration and one on the last iteration. Why?
Dynamic prediction: 2-bit Predictor
2-bit prediction: for each index the BHT contains two prediction
bits that change as in the figure below
Key idea: the prediction must be wrong twice for it to be
changed
Example: What is the prediction accuracy of a 2-bit predictor on the
loop of the previous example?
2-bit Predictor Statistics
Prediction accuracy of 4K-entry 2-bit prediction buffer vs. “infinite” 2-bit buffer:
increasing buffer size from 4K does not significantly improve performance
n-bit Predictors
d= b1 b1 new b1 b2 b2 new b2
prediction action prediction prediction action prediction
2 NT T T NT T T
0 T NT NT T NT NT
2 NT T T NT T T
0 T NT NT T NT NT
d= b1 b1 new b1 b2 b2 new b2
prediction action prediction prediction action prediction
Behavior of 1-bit predictor with 1-bit of correlation, assuming initially NT/NT and d
alternating between 0 and 2: mispredictions only on first iteration.
Predictions used in red.
Correlating Predictors:
(m,n) Predictors
The correlating predictor as before – 1 bit of prediction plus 1
correlating bit – is called a (1,1) predictor
Generalization of the (1,1) predictor is the (m,n) predictor
(m,n) predictor : use the behavior of the last m branches to
choose from one of 2m branch predictors, each of which is an
n-bit predictor
The history of the most recent m branches is recorded in an
m-bit shift register called the m-bit global history register
shift in the behavior bit for the most recent branch, shift out the
the bit for the least recent branch
Index into the BHT by concatenating the lower bits of the
branch instruction address with the m-bit global history to
access an n-bit entry
(2, 2) Correlating Branch Predictors
Example of (2, 2) Correlating Predictor
Example of (2, 2) Correlating
Predictor
Accuracy of Correlating
Predictors
Accuracy of Correlating Predictors
Tournament Predictors
Motivation for correlating branch predictors:
2-bit local predictor failed on important
branches; by adding global information,
performance improved
Tournament predictors: use two predictors, 1
based on global information and 1 based on
local information, and combine with a selector
Hopes to select right predictor for right
branch (or right context of branch)
Tournament Predictor in Alpha 21264
4K 2-bit counters to choose from among a global
predictor and a local predictor
Global predictor also has 4K entries and is indexed by
the history of the last 12 branches; each entry in the
global predictor is a standard 2-bit predictor
12-bit pattern: ith bit is 0 => ith prior branch not taken;
ith bit is 1 => ith prior branch taken;
Here c1/c2 means: correctness of predictor 1 /
correctness of predictor 2
00,10,11 00,01,11
1
Use 2 2
Use 1
3 4K 2
10 01 01 10 .. bits
01 .
Use 1 Use 2
10 12
00,11 00,11
Tournament Predictor in Alpha 21264
Local predictor consists of a 2-level predictor:
Top level a local history table consisting of 1024 10-bit
entries; each 10-bit entry corresponds to the most recent 10
branch outcomes for the entry. 10-bit history allows patterns
10 branches to be discovered and predicted. Indexed by local
branch address.
Next level Selected entry from the local history table is used
to index a table of 1K entries consisting a 3-bit saturating
counters, which provide the local prediction
Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits!
(~180K transistors)
1K
1K 10 3
bits
bits
% of predictions from local predictor
in Tournament Prediction Scheme
0% 20% 40% 60% 80% 100%
nasa7 98%
matrix300 100%
tomcatv 94%
doduc 90%
spice 55%
fpppp 76%
gcc 72%
espresso 63%
eqntott 37%
li 69%
Accuracy of Branch Prediction
99%
tomcatv 99%
100%
95%
doduc 84%
97%
86%
fpppp 82%
98% Profile-based
2-bit counter
88% Tournament
li 77%
98%
86%
espresso 82%
96%
88%
gcc 70%
94%
9%
8%
5%
4%
Correlating - (2,2) scheme
3%
2% Tournament
1%
0%
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Branch PC Predicted PC
FETCH
PC of instruction
=? Yes: instruction is
branch; use Prediction state
predicted PC as bits
No: branch not predicted;
proceed normally (PC+4) next PC (if
predict Taken)
Branch Target “Cache”
Branch Target cache - Only predicted taken branches
“Cache” - Content Addressable Memory (CAM) or Associative
Memory (see figure)
Use a big Branch History Table & a small Branch Target Cache
Branch PC Predicted PC
PC
=? Prediction state
Yes: predicted taken bits (optional)
No: not found branch found