Reducing Capacity and Conflict Misses
using Set Saturation Levels
Dyer Rolán, Basilio B. Fraguela and Ramón Doallo
Grupo de Arquitectura de Computadores
Departamento de Electrónica e Sistemas
Universidade da Coruña
e-mail: {drolan, basilio.fraguela, ramon.doallo}@udc.es
Abstract—The well-known memory wall problem has motivated wide research in the design of caches. Last-level caches,
whose misses can stall the processors for hundreds of cycles, have
received particular attention. Strategies to modify adaptably the
cache insertion, promotion, eviction and even placement policies
have been proposed, some techniques being better at reducing
different kinds of misses. For example changes in the placement
policy of a cache, which are a natural option to reduce conflict
misses, can do little to fight capacity misses, which depend on the
relation between the working set of the application and the cache
size. Nevertheless, other techniques such as the recently proposed
dynamic insertion policy (DIP), whose aim is to retain a fraction
of the working set in the cache when it is larger than the cache
size, attack primarily capacity misses. In this paper we present a
coordinated strategy to reduce both capacity and conflict misses
by changing the placement and insertion policies of the cache.
Our strategy takes its decisions based on the concept of the Set
Saturation Level (SSL), which tries to measure to which degree a
set can hold its working set. Despite requiring only less than 1%
storage overhead, our proposal, called Bimodal Set Balancing
Cache, reduced the average miss rate of a baseline 2MB 8-way
second level cache by 16%, which translated into an average IPC
improvement of 4.8% in our experiments.
Index Terms—Cache; performance; adaptivity; balancing; insertion; replacement; thrashing; set saturation level
I. I NTRODUCTION
Memory hierarchy plays a key role in the performance
of current computers given the memory wall problem. This
has led to the current state of the art with several levels
of caches, the ones nearest to the processor being primarily
optimized for latency, and the last-level caches (LLC) being
responsible for avoiding off-chip memory accesses, which
can stall the processor for hundreds of cycles. The larger
flexibility that LLCs allow with respect to their response times
and the enormous importance of reducing their miss rate has
led the community to propose a large number of techniques
to improve their adaptability to the behavior of applications.
For example, the problem of the lack of uniformity in the
distribution of the memory references among the sets of set
associative caches, which is a fundamental source of conflict
misses, has been addressed by victim caches [1], the adaption
of the assignment of lines to sets [2] or the displacement of
lines from oversubscribed sets to underutilized ones [3]. Other
proposals seem more adequate to reduce capacity misses. A
very good example is [4], which targets memory-intensive
workloads with working sets that not fit in the cache for which
the traditional LRU replacement policy is counterproductive.
In this paper we investigate the possibility of combining a
technique that targets primarily conflict misses with another
one particularly suitable to reduce capacity misses. Our proposal is based on the set saturation level, a concept proposed
in [3] to measure the degree to which a set is able to hold
its working set. This indicator was used to decide when a
set requires more lines than the ones available, or when its
lines were underutilized. The Set Balancing Cache (SBC)
they introduce associates sets with high saturation levels with
sets with reduced saturation levels, allowing displacements of
lines from the former to the latter. Since the SBC targets
the placements of lines in the cache, it is mainly effective
at reducing conflict misses, but it can do little to improve the
performance when the working set of a workload is larger than
the cache size.
The core idea of our proposal is to use the set saturation
levels not only as indicators of unbalance among the cache
sets, but also as detectors of lack of capacity of the cache
to hold the working set. In order to solve the first problem,
our design, which we call Bimodal Set Balancing Cache
(BSBC) applies the Dynamic SBC (DSBC) introduced in [3].
If the lack of lines to hold the working sets persists after the
displacement of lines from oversubscribed sets to underutilized
ones, the BSBC applies a policy to address problems of
capacity. Namely, the insertion policy of highly saturated sets
is changed to the Bimodal Insertion Policy (BIP) [4], which
often inserts lines in the LRU position instead of the MRU one.
This avoids that lines that are dead on arrival expel other lines
from the cache as they descend in the LRU stack. Another
possible interpretation of our strategy is that both DSBC and
BIP target the same problem, which is that a set may have
more active blocks than ways, the first solution trying to move
the blocks to an underutilized or cold set, while the other one
tries to evict cold blocks within the set. Our approach then
switches between the best one of the two solutions depending
on the availability of underutilized sets.
Our experiments show that our coordinated approach to
fight conflict and capacity misses works substantially better
than simply applying simultaneously in a cache DSBC and
Dynamic Insertion Policy (DIP) [4]. The latter is an insertion
policy that chooses dynamically between the traditional MRU
insertion policy and BIP based on a set dueling mechanism
that tries to identify the one that incurs fewer misses.
The rest of this paper is organized as follows. The next
section introduces the basics of the DSBC cache and the
BIP and DIP insertion policies, discusses the limitations of
those approaches and reasons why and in which way they
can be complementary. This leads to the description of the
Bimodal Set Balancing Cache we propose in Section III, which
is evaluated using the environment described in Section IV.
The results are discussed in Section V. The cost of this
approach is examined in Section VI. Related work is discussed
in Section VII. Finally, the last section is devoted to the
conclusions and future work.
In this situation the line evicted is moved to the destination
set of the association rather than to memory. Displaced lines
are marked with a bit that allows to distinguish them from the
native lines of the destination set. Insertions of lines in sets
always take place in the MRU position of the recency stack
and the traditional LRU replacement policy is applied. Another
consequence of the association is that while it lasts, accesses
that result in misses in the source set of the association make a
second attempt to find the requested data in the destination set.
This gives place to secondary hits and misses. An association
is broken when the destination set evicts all the lines it has
received from the source set.
II. T WO COMPLEMENTARY CACHE MANAGEMENT
Most caches nowadays use a LRU replacement policy in
which lines are inserted in the MRU position in the recency
stack. The lines must then descend this stack position by
position until reaching the LRU position before being evicted.
Although this policy works very well for many workloads,
in [4] it was observed that it often leads memory-intensive
workloads with working sets larger than the cache size to
thrashing. As a result, they proposed an LRU Insertion Policy
(LIP) which always inserts lines in the LRU position of the
recency stack. If the inserted line is reutilized, it is moved to
the MRU position, as in any cache. If the line is not reused
before the next miss in the set, the next line inserted replaces
it. At this point it is very important to remember that [4],
just as [3] and this paper, deal with non first level caches.
LIP exploits the fact that in these caches many lines are dead
on arrival (i.e. they are not reused in the cache before their
eviction) because all their potential short term temporal or
spatial locality is exploited in upper level caches.
While LIP works well for some workloads, it may tend
to retain in the non-LRU positions of the recency stack lines
which are actually not useful, that is, that do not belong to the
current working set. Thus a Bimodal Insertion Policy (BIP)
that tries to adapt the contents of the set to the active working
set of the application was also proposed in [4]. BIP achieves
this by inserting with a low probability ε the incoming lines
in the MRU position of the recency stack, operating like LIP
all the other times.
There is not an absolute winner between BIP and the
traditional MRU insertion policy, BIP being better suited
for some applications and the traditional policy for others.
Thus [4] proposed the Dynamic Insertion Policy (DIP), which
uses set dueling to track the behavior of both insertion policies
in order to apply the best one to the remaining cache sets,
called follower sets. Set dueling requires dedicating a fraction
of the cache sets to operate always under BIP and another
fraction to apply always MRU insertion. A group of 32 sets
dedicated to each policy is found to be good in [4]. The misses
in one of the groups of sets increase the value of a global
saturating counter while the misses in the other group decrease
it. The most significant bit of the counter indicates then which
is the best policy in each moment. All the follower sets apply
the policy indicated by the counter.
POLICIES
As the preceding section states, the bibliography shows
proposals that are more suitable to reduce conflict misses
due to the oversubscription of specific sets of the cache and
techniques that try to address a global problem of capacity
of the cache with respect to the working set in use. We
have chosen as representative techniques of both families of
proposals the Dynamic Set Balancing cache [3] and the novel
insertion policies proposed in [4]. Both approaches will be
discussed in turn, followed by a constructive critic of their
limitations and their complementarity.
A. Dynamic Set Balancing Cache
The basic idea of the Dynamic Set Balancing Cache, DSBC
in what follows, is to alleviate the problems of oversubscribed
cache sets by moving part of the lines originally mapped to
them to other sets that have underutilized lines. This requires
detecting the degree to which each cache set is able to hold
its working set. The DSBC achieves this with a metric called
Set Saturation Level (SSL) which is tracked separately for
each set by means of a counter called saturation counter. This
counter, which has saturating arithmetic, is increased each time
an access to the set results in a miss, and decreased when it
results in a hit. A counter in the range 0 to 2K-1 for K-way
associative caches is proposed in [3].
A cache set is deemed as highly saturated, and thus unable
to hold its working set, when its SSL, which is the value of
its saturation counter, reaches the maximum (2K-1). At that
point, the DSBC tries to associate this set with another set with
a low SSL that will be used to store part of the working set
of the saturated set. The DSBC chooses for this purpose the
set with the lowest SSL in the cache that is not yet associated
to another set, provided that its SSL is smaller than K. The
rationale for this latter limitation is that it does not seem useful
to associate two sets with high SSLs in order to lend lines from
one of them to the other one.
Once an association is established, the less saturated set
becomes the destination set for displacements of lines from the
highly saturated set, which becomes the source set of the association. The displacements take place when a line is evicted
from the source set and its SSL adopts the maximum value.
B. Adaptive Insertion Policies
60
5
401
.bz
ip2
433
.mi
lc
445
.go
bm
k
450
.sop
lex
456
.hm
me
r
458
.sje
ng
462
.lib
qua
ntu
m
471
.om
net
pp
473
.ast
ar
482
.sph
inx
3
geo
me
an
0
40
20
0
ip2
433
.mi
lc
445
.go
bm
k
450
.sop
lex
456
.hm
me
r
458
.sje
ng
462
.lib
qua
ntu
m
471
.om
net
pp
473
.ast
ar
482
.sph
inx
3
geo
me
an
10
DIP
Local DIP
401
.bz
DIP
Local DIP
% Miss Rate reduction
% IPC improvement
15
Figure 1. Percentage of IPC improvement over a 2MB 8-ways and 64 bytes Figure 2. Miss rate reduction related to a 2MB 8-ways and 64 bytes line
line size baseline cache using DIP and Local DIP.
size baseline cache using DIP and Local DIP.
C. Discussion
The DSBC relies on the unbalance of the working set of
different sets to trigger the mechanisms that make it different
from a standard cache. As a result it is oriented to reduce conflict misses rather than capacity misses. Another consequence
of this approach is that its metrics must be defined at the cache
set level in order to find these unbalances, which is indeed the
case of the SSL.
The adaptive insertion policies proposed in [4] alleviate the
lack of capacity of the cache to hold the data set manipulated
by an application. Since this is a global problem for all the
cache, these policies are applied to all the sets at once. For
the same reason, DIP, the one that can adapt dynamically to
the characteristics of the workloads, relies on a global metric
gathered on the behavior of sets spread along the cache.
This way, it looks straightforward that DSBC and DIP
should be complementary and that implementing them simultaneously in a cache will offer a higher level of protection
against misses than using only one of them. This is also very
feasible given their reduced hardware overhead. Nevertheless,
the simultaneous implementation of DSBC and DIP, which
we call DSBC+DIP, yields results very similar or even worse
than the ones achieved with any of them independently, as we
will see in Section V. The main problems happen when DIP
chooses BIP for the followers. DSBC displaces the LRU line
of the source sets, as it seems the natural option. This means
that the line that DIP exposes to be evicted on the next miss
in the source set of an association is actually saved by DSBC,
which moves it to the destination set. Since these lines are not
actually useful, their existence in the destination set gives place
to unsuccessful second searches, which delay the resolution of
the miss by the time taken to make a new access to the tag
array. Even worse, they may expel an actually useful line just
inserted in the LRU position of the destination set before it
gets a chance to be reused.
When DIP chooses the traditional insertion policy for the
followers, a DSCB+DIP cache behaves very much as a DSBC
with two penalties. The first one, which is inherent to DIP,
is the existence of sets (32 in our implementation, as advised
in [4]) that are forced to apply BIP for the sake of the set
dueling even when it is not performing well. The other is that
when these sets are involved in an association they generate
the problems discussed above.
As a result, while these policies seem complementary, they
require a coordinated approach to work properly together. Our
proposal is presented in the next section.
III. B IMODAL S ET BALANCING C ACHE
A first issue we explore in the attempt to exploit jointly
DSBC and the adaptive insertion policies is the possibility
of using the same metric to control them. This will ease
their coordination and it can even simplify the hardware with
respect to the one required for implementing them separately,
thanks to the reuse of the hardware that computes the metric.
A metric per set like the one that DSBC requires cannot
be obtained from the global counter used by DIP. Thus we
checked whether the decision that DIP takes based on the set
dueling can be made instead based on the SSL provided by
the DSBC. This would not only simplify the design of the
cache, but also avoid having always a fraction of the cache
sets working with a wrong policy, even if this fraction is small.
A way to achieve this is to use the SSL of each set to decide
whether the traditional insertion policy of BIP is better suited
for that specific set. Our proposal is to change the insertion
policy of a set to BIP only if it gets saturated (SSL=2K-1 for a
saturation counter in the range 0 to 2K-1), and revert to MRU
insertion when it reduces clearly its SSL. An SSL below K
has been chosen to trigger the change to MRU insertion. This
proposal, which we call Local DIP, only involves a saturation
counter and one additional bit per set, called insertion policy
bit, that indicates the insertion policy of the set. Local DIP is
compared with DIP in terms of IPC improvement and miss
rate reduction over a baseline 8-way second level cache of
2MB with lines of 64 bytes in Figures 1 and 2, respectively.
BIP uses ε = 1/32 as the probability a new line is inserted in
the MRU position of the recency stack instead of in the LRU
one in both implementations. The simulation environment and
the benchmarks are explained in Section IV. The results are
similar, Local DIP being slightly better than DIP on average.
Thus we dropped set dueling in favor of a local per-set decision
based on its SSL.
Let us consider now the nature of the SSL. A high SSL
indicates that the set cannot hold its working set, but it is
difficult to know only with this value whether this is a problem
specific to the set, which means other sets have no problems
with their working sets, or a global problem of capacity of
the cache. The answer lies in the comparison with the SSL
of the other cache sets. If the cache has enough capacity to
hold its working set, the DSBC mechanism should be able
to find suitable sets to be associated to the problematic one,
allowing it to displace part of the lines of its working set
to a destination set with underutilized lines. If the DSBC
cannot associate the set, that is because there are no sets
with a SSL low enough to deem them good candidates to
receive lines from other sets. This then points to a potential
problem of capacity of the cache, which can be dealt with
adopting BIP. Since DSBC only seeks to initiate associations
when a set is saturated, a good strategy is to first try to
associate the set to a destination set, and if no good candidate
is found, change the set insertion policy to BIP. Altogether
this strategy equates to first trying to consider the high SSL in
the set as a local problem, that is, conflict misses due to the
oversubscription of this specific set, and if this fails, consider
that there may be a global problem of capacity which requires
turning to BIP. If the cache has a capacity problem, it is very
likely that sets that were chosen previously as destinations of
an association become saturated too. Thus we propose that
destination sets that become saturated change their insertion
policy to BIP. Relatedly, it is logical that if the source set of
an association gets saturated and its destination set is applying
BIP, the source set changes to BIP too. This acknowledges
that a capacity problem rather than a local conflict problem is
been faced. Finally, just as in our Local DIP, if the SSL of a
set in BIP mode drops below K, its insertion policy changes
to MRU insertion, since the capacity problems seem to have
disappeared.
The eviction of recently inserted lines in destination sets that
operate under BIP by lines displaced from their source set was
identified as one of the problems of the DSBC+DIP approach
in Section II-C. BIP puts useful lines in destination sets in a
dangerous situation because, since they are inserted in the LRU
position, any displacement before their reuse evicts them from
the cache. Our design avoids this enforcing that destination
sets in BIP mode are in read-only mode. This means that
misses in their source set will lead to searches in them, but
no displacements of lines from the source set will be allowed.
This is consistent with our view that BIP is triggered to fight
capacity problems rather than conflicts. Another positive sideeffect of this policy is that since no displacements are allowed
in BIP mode, it is easier to break the association, which is in
fact not helpful when there is a capacity problem. Let us recall
Table I
A RCHITECTURE . I N THE TABLE RT, TC AND MSHR STAND FOR ROUND
TRIP, TAG DIRECTORY CHECK AND MISS STATUS HOLDING REGISTERS ,
RESPECTIVELY.
Processor
Frequency
Fetch/Issue
Inst. window size
ROB entries
Integer/FP registers
Integer FU
FP FU
4GHz
6/4
80 int+mem, 40 FP
152
104/80
3 ALU,Mult. and Div.
2 ALU, Mult. and Div.
Common memory subsystem
L1 i-cache & d-cache
L1 Cache ports
L1 Cache latency (cycles)
L1 MSHRs
System bus bandwidth
Memory latency
32kB/8-way/64B/LRU
2 i/ 2 d
4 RT
4 i / 32 d
10GB/s
125ns
Two levels specific memory subsystem
L2 (unified, inclusive) cache
2MB/8-way/64B/LRU
L2 Cache ports
1
L2 Cache latency (cycles)
14 RT, 6 TC
L2 MSHR
32
Three levels specific memory subsystem
L2 (unified, inclusive) cache
L3 (unified, inclusive) cache
Cache ports
Cache latency (cycles)
MSHR
256kB/8-way/64B/LRU
2MB/16-way/64B/LRU
1 L2, 1 L3
11 RT L2, 4 TC L2, 39 RT L3, 11 TC L3
32 L2, 32 L3
that the association is broken when during its operation the
destination set evicts all the lines it received from the source
set. Finally, when the SSL of a destination set goes below K,
besides reverting to MRU insertion, it also enables again the
displacement of lines from its source set.
Altogether, our proposal, which we call Bimodal Set Balancing Cache (BSBC) because it is a Set Balancing Cache with
an integrated BIP, has almost the same hardware overhead as
a DSBC. Only one additional bit is required per set in order
to store its current insertion policy. As for the time required
to apply the BSBC algorithms, just as in the DSBC and DIP,
they are triggered by misses and can be thus overlapped with
their resolution. The contention in the tag array due to second
searches has been considered in our evaluation.
IV. S IMULATION E NVIRONMENT
Our evaluation is based on simulations performed using the
SESC environment [5]. The baseline system has a four-issue
out-of-order CPU clocked at 4GHz. Two memory hierarchies,
one with two levels of cache, and another one with three levels,
have been studied. When a modification of the design of the
caches of the baseline system is evaluated, it is applied in the
L2 cache of the hierarchy with two levels of cache, and in the
L2 and L3 caches of the hierarchy with three levels. Table I
displays the simulation parameters, whose timings assume a
45nm technology process.
The three-level hierarchy is inspired in the Core i7 [6], the
L3 being proportionally smaller to account for the fact that
only one core is used in our experiments. Both configurations
allow an aggressive parallelization of misses, as there are
Table II
B ENCHMARKS CHARACTERIZATION .
Benchmark
bzip2
milc
gobmk
soplex
hmmer
sjeng
libquantum
omnetpp
astar
sphinx3
# L2 Accesses
125M
255M
77M
105M
55M
32M
156M
100M
192M
122M
2MB L2 Miss rate
9%
71%
5%
8%
10%
26%
74%
28%
23%
68%
256kB L2 Miss rate
41%
75%
10%
15%
41%
27%
74%
91%
48%
76%
Component
INT
FP
INT
FP
INT
INT
INT
INT
INT
FP
(a)
100
80
1st Misses
2nd Misses
1st Hits
2nd Hits
60
40
20
0
(b)
100
80
60
40
20
0
(c)
100
80
60
40
20
0
mean
bas
e
D
DS IP
BC
BS
BC
482.Sphinx3
bas
e
D
DS IP
BC
BS
BC
473.Astar
bas
e
DI
DS P
BC
BS
BC
bas
e
D
DS IP
B
BS C
BC
462.Libquantum 471.Omnetpp
bas
e
D
DS IP
BC
BS
BC
458.Sjeng
bas
e
D
DS IP
BC
BS
BC
456.Hmmer
bas
e
D
DS IP
BC
BS
BC
450.Soplex
bas
e
DI
DS P
BC
BS
BC
445.Gobmk
bas
e
D
DS IP
BC
BS
BC
433.Milc
bas
e
D
DS IP
BC
BS
BC
bas
e
D
DS IP
B
BS C
BC
401.Bzip2
Figure 3. Primary miss, secondary miss, primary hit and secondary hit rates for the baseline system, a DIP variation, a DSBC variation and a BSBC one
for (a) L2 cache in the two-level configuration, (b) L2 cache in the three-level configuration, and (c) L3 cache in the three-level configuration.
between 16 and 32 Miss Status Holding Registers per cache.
The tag check delay and the total round trip access are
provided for the L2 and L3 to help evaluate the cost of second
searches when there are associations. As in several existing
processors [7][8], and works in the bibliography [9][10][2][3],
the accesses to non-first level caches access sequentially the
tag and the data arrays. This reduces the power dissipation of
large cache arrays and limits the additional delay of second
searches to the tag check delay.
As for the parameters that are specific to the different
approaches used in this study, DIP uses 32 sets dedicated
to each policy to decide between BIP and MRU insertion.
This BIP, as well as the one triggered by the BSBC, uses a
probability ε = 1/32 that a new line is inserted in the MRU
position of the recency stack. The DSBC and the BSBC use
a DSS, or Destination Set Selector, of four entries like the
one used in [3]. This is the structure that stores the potential
sets for an association in order to provide quickly a candidate
when it is requested.
A. Benchmarks
Ten benchmarks from the SPEC CPU 2006 suite have been
used to evaluate our approach. They have been simulated using
the reference input set (ref ), during 10 billion instructions
after the initialization. Table II characterizes them with the
number of L2 accesses during the simulation, the miss rate in
the L2 cache both in the two-level (2MB L2) and the threelevel (256kB) configurations, and whether they belong to the
INT or FP set of the suite. As we can see, the benchmarks in
this mix vary largely both in number of accesses that reach
the caches under the first level and in the miss ratios.
V. R ESULTS AND COMPARISON WITH OTHER APPROACHES
The effects just discussed can be seen more clearly in
Figure 4, which shows the relative miss rate reduction with
respect to the baseline L2 cache of the configuration with
two levels of cache for five cache designs. The approaches
compared are DIP, DSBC, their simultaneous implementation
60
% Miss Rate reduction
DIP
Dynamic SBC
DIP+DSBC
PseudoLIFO
Bimodal SBC
40
20
ip2
433
.mi
lc
445
.go
bm
k
450
.sop
lex
456
.hm
me
r
458
.sje
ng
462
.lib
qua
ntu
m
471
.om
net
pp
473
.ast
ar
482
.sph
inx
3
geo
me
an
0
401
.bz
The behavior of DIP, DSBC and BSBC is compared in
the second level cache of the hierarchy with two levels of
caches and the two lower levels of the hierarchy with three
levels in Figure 3. It shows the rate of accesses that result in
misses after a single access to the tag-array (primary misses)
or two (secondary misses), and accesses that hit in the cache
in the first check of the tag-array (primary hits) or after the
second one (secondary hits). Only the DSBC and the BSBC
present secondary accesses, which take place when an access
misses in the source set of an association. The last group of
columns (mean), represents the arithmetic mean of the rates
observed in each cache. For example in the L2 of the two-level
configuration the BSBC gets an average miss rate (considering
both kinds of misses) of 27% compared to the 32% of the
baseline configuration. This is a relative reduction in the miss
rate of 15.7%. In this cache DIP achieves a miss rate reduction
of 10% and DSBC of 11.5%. So we can see that our design
allows the two policies to work coordinately getting the best
of each one of them. The ratio of all the cache accesses that
result in secondary misses is 2% and 3% for the BSBC and
the DSBC, respectively. The additional delay of a secondary
miss with respect to a primary miss is small, but still it is
good that BSBC not only generates more hits than DSBC
but also reduces by 1/3 the number of secondary misses. The
reduction is not surprising if we realize that in applications
with capacity problems, the saturation of the destination sets
will avoid displacements of lines that are actually not useful.
This is will also enable these sets to break the association
before. Let us remember that an association is broken when the
destination set evicts all the lines received from the source set.
Altogether this leads to fewer unsuccessful secondary searches
than in the DSBC. DSBC and BSBC present the same rate of
accesses that result in secondary hits, about 4%.
Looking at individual benchmarks we can appreciate how
BSBC adapts to the different types of applications, often
performing better than both DIP and DSBC. For example, in
433.milc, 462.libquantum and 482.sphinx3, which are more
suited to DIP than to DSBC, BSBC achieves similar results to
DIP (somewhat worse in 482.sphinx3), and better than DSBC.
The BSBC is also able to adapt to those applications
that benefit more from DSBC because of imbalances in the
working sets sizes for different cache sets. This happens,
for example, in 401.bzip2, 471.omnetpp and 473.astar, where
BSBC and DSBC work better than DIP. Therefore, the BSBC
works largely as DIP for streaming applications using BIP, and
mostly as the DSBC when the application presents imbalances
among the working sets of sets. It is often the case that the
BSBC even improves over both approaches by combining both
behaviors.
Figure 4. Miss rate reduction over the baseline in the two-level configuration using DIP, DSBC, their simultaneous implementation DIP+DSBC,
pseudoLIFO and the BSBC.
DIP+DSBC, pseudoLIFO and the BSBC. PseudoLIFO stands
for the probabilistic escape LIFO, which has been simulated
approximating the hit counts by the next power of two
for escape probabilities. This is the best policy among the
family of pseudoLIFO replacement policies proposed in [11]
according to its evaluation. These policies will be reviewed
briefly in Section VII.
The last group of columns in Figure 4 is the geometric
mean of the miss rate reduction for each approach. Altogether,
BSBC achieves a relative miss rate reduction of 16% compared
to the 12% of the DSBC, the 11% of pseudoLIFO and the
10% of DIP. The worst performer is DIP+DSBC, which only
reduces the miss rate by 8.3% despite its increased complexity
with respect to DSBC or DIP taken separately. So it is
very interesting that in fact without the contributions of this
paper, the combination of DIP and DSBC achieves the worst
performance, while following the approach we propose, their
coordinated application achieves the best result by far.
Figures 5 and 6 show the performance improvement in
terms of instructions per cycle (IPC) for each benchmark
with respect to the baseline system in the two level and the
three level memory hierarchies tested, respectively. The cache
designs considered are the same as in Figure 4. In the two-level
configuration the BSBC always has a positive or, at worst,
in the case of 445.gobmk benchmark, a negligible negative
effect on performance smaller than 1%. The geometric mean
of the relative IPC improvement for DSBC with respect to the
baseline configuration is 4.8% in the two-level configuration.
DIP, DSBC, DIP+DSBC and pseudoLIFO achieve 3.2%, 3.6%,
3% and 3.4%, respectively. The analysis based on IPC points
again to the importance of the contributions of this paper.
A coordinated effort of DIP and DSBC guided by the SSL
allows to go from the worst results with a straightforward
DIP+DBSC to the best performance with BSBC. The situation
is very similar in the configuration with three levels of cache.
Here BSBC improves 6% on average the IPC with respect
15
DIP
Dynamic SBC
DIP+DSBC
PseudoLIFO
Bimodal SBC
10
% IPC improvement
5
10
5
401
.bz
ip2
433
.mi
lc
445
.go
bm
k
450
.sop
lex
456
.hm
me
r
458
.sje
ng
462
.lib
qua
ntu
m
471
.om
net
pp
473
.ast
ar
482
.sph
inx
3
geo
me
an
0
ip2
433
.mi
lc
445
.go
bm
k
450
.sop
lex
456
.hm
me
r
458
.sje
ng
462
.lib
qua
ntu
m
471
.om
net
pp
473
.ast
ar
482
.sph
inx
3
geo
me
an
0
DIP
Dynamic SBC
DIP+DSBC
PseudoLIFO
BSBC
401
.bz
% IPC improvement
15
Figure 5. Percentage of IPC improvement over the baseline in the two-level Figure 6. Percentage of IPC improvement over the baseline in the three-level
configuration using DIP, the DSBC, a combination of both previous approaches, configuration using DIP, the DSBC, a combination of both previous approaches,
pseudoLIFO or the BSBC.
pseudoLIFO or the BSBC.
Table III
BASELINE , DSBC AND BSBC STORAGE COST IN A 2MB/8- WAY /64B/LRU CACHE .
Baseline
DSBC
BSBC
Tag-store entry:
State(v+dirty+LRU+[d])
Tag (42 − log2 sets − log2 64)
Size of tag-store entry
5 bits
24 bits
29 bits
6 bits
24 bits
30 bits
6 bits
24 bits
30 bits
Data-store entry:
Set size
64*8*8 bits
64*8*8 bits
64*8*8 bits
Additional structs per set:
Saturation Counters
Insertion policy bit
Association Table
Total of structs per set
-
4 bits
12+1 bits
17 bits
4 bits
1 bit
12+1 bits
18 bits
DSS (entries+registers)
-
4*(1+12+4)+2*(2+4) bits
4*(1+12+4)+2*(2+4) bits
Number of tag-store entries
Number of data-store entries
Number of Sets
Size of the tag-store
Size of the data-store
Size of additional structs
32768
32768
4096
118.7kB
2MB
-
32768
32768
4096
122.8kB
2MB
8714B
32768
32768
4096
122.8kB
2MB
9226B
Total
2215kB
2228kB (< 1%)
2229kB (< 1%)
to the baseline system, while DIP, DSBC, DIP+DSBC and
pseudoLIFO achieve increases of 3.8%, 5%, 4% and 3.5%,
respectively.
We can observe that pseudoLIFO outperformed DIP and
DIP+DSBC in the two-level configuration. Nevertheless the
opposite happened in the three-level configuration, and by
somewhat larger margins. This, coupled with the negligible
hardware cost and simple algorithm of DIP were the main
reasons to choose DIP as the preferred approach to deal with
capacity misses in the BSBC design.
VI. C OST
We consider here the costs of the BSBC in terms of storage
and area. The latter has been modeled with CACTI 6.5 [12]
assuming a 45 nm technology process. The costs of the DSBC
cache are also computed for comparison purposes, the cost of
DIP being negligible (just 10 bits for the global counter).
The BSBC, like the DSBC, requires the following additional
hardware with respect to a standard cache: a saturation counter
per set to compute the SSL, an additional bit per entry in
the tag-array to identify displaced lines (d bit), an Association
Table with one entry per set that stores a bit to specify whether
the set is the source or the destination of the association, and
the index of the set it is associated to, and finally a Destination
Set Selector (DSS) to choose the best set for an association. A
4-entry DSS has been used in our evaluation. The BSBC needs
also one bit per set to indicate the set insertion policy. Based
on this, Table III calculates the storage required for a baseline
8-way 2 MB cache with lines of 64B assuming addresses of 42
bits. The corresponding area overhead calculated by CACTI
is shown in Table IV.
Table IV
BASELINE , DSBC AND BSBC AREA .
Config.
Components
Details
Baseline
Data + Tag
2MB 8-way 64B linesize + tag-store
Dynamic SBC
Data + Tag
Counters
Association Table
DSS (entries + registers)
2MB 8-way 64B linesize + 1 additional displaced bit in tag-store
4096*4 bits
4096*12 bits
4*(1+12+4)+2*(2+4) bits
4.31 mm2
< 0.01 mm2
< 0.01 mm2
< 0.01 mm2
Data + Tag
Counters
Insertion Policy
Association Table
DSS (entries + registers)
2MB 8-way 64B linesize + 1 additional displaced bit in tag-store
4096*4 bits
4096 bits
4096*12 bits
4*(1+12+4)+2*(2+4) bits
4.31 mm2
< 0.01 mm2
< 0.01 mm2
< 0.01 mm2
< 0.01 mm2
Bimodal SBC
VII. R ELATED WORK
There have been many proposals to improve the performance of caches. We will review here briefly the ones that
are more related to ours. The impossibility for some cache
sets to hold their working set has been addressed by victim
caches [1], which simply store the latest lines evicted from the
cache. This idea has been later refined with heuristics to decide
which lines to store in the victim cache. For example, [13]
takes its decisions on reload intervals, while [14] considers the
frequency with which each line appears in the miss stream.
The unbalance in the working set of different cache sets
has also been tackled with a large variety of approaches.
This way, cache indexing functions [15][16] that distribute
accesses more uniformly than the standard indexing have been
proposed. Another alternative are pseudo-associative caches,
which increase the flexibility of the placement of lines in
the cache considering two [17][18] or more alternative locations [19] for each line. The pseudo-associative caches,
contrary to the BSBC, do not use any mechanism such as the
SSL to decide when it is better to displace a line. Besides
they perform searches line by line rather than set by set.
Something similar happens with the adaptive group-associative
cache (AGAC) [20], a proposal for first level direct-mapped
caches that stores lines that would have been otherwise evicted
in underutilized cache frames. AGAC takes decisions on a line
basis, based by recency of use and needs multiple banks to
aid swapping.
The Indirect Index Cache (IIC) [9] offers full associativity
because its tag-store entries keep pointers that allow to associate them to any data-entry. The IIC is accessed through a
hash table with chaining in a singly-linked collision table and
its management is much more complex than that of the BSBC.
For example, its generational replacement is run by software.
The NuRAPID cache [10] relocates data-entries inside the
data array in order to reduce the average access latency. This
requires the usage of pointers between tag-entries and dataentries. Nevertheless its tag-array is managed in a standard
way both in terms of mapping and replacement policies. As
a result, it has the same miss rate and workload imbalance
problems among sets as a standard cache.
The V-Way cache [2] duplicates the number of sets and
tag-store entries while keeping the same associativity and
number of data lines. Data lines are assigned dynamically
Subtotal
4.3
Total
mm2
4.3
mm2
4.33
mm2
(< 1%)
4.34
mm2
(< 1%)
to sets depending on the demand that each set experiences
and a global replacement algorithm on the data lines based
on their reuse frequency. This allows different sets to end up
having different numbers of lines. Forward pointer between
the tag-store and the data-store entries allow any data line
to be assigned to any tag-entry. Reverse pointers allow the
replacement algorithm on the data lines to reassign them from
one tag-store entry to another one. According to their studies,
the V-Way cache offers similar miss rate reductions as IIC and
outperforms AGAC.
The Dynamic Set Balancing Cache (DSBC) [3] has already
been explained in detail in Section II-A. That paper also
presented a Static SBC (SSBC), which restricts each set to
be associated only with the farthest set in the cache, that
is, the one whose index is obtained complementing the most
significant bit of the set index. We have used DSBC because it
outperformed SSBC consistently in their experiments thanks
to its flexibility, while the extra implementation cost is small.
It also achieved better miss rates than the V-way cache and
DIP.
The proposals we have just discussed emphasize the flexibility of placement of lines in the cache to improve miss
rates or access time. Other researchers have focused on the
modification of the insertion and replacement policies in order
to keep in the set where they belong the most useful lines.
Examples of this kind of works are the adaptive insertion
policies in [4] that were detailed in Section II-B and the
pseudo-LIFO replacement policies [11]. Pseudo-LIFO policies
evict blocks from the upper part of the fill stack, i.e., among
the most recently inserted lines of the set. This contributes
to retain a large fraction of the working set in the cache.
The probabilistic escape LIFO policy used in our evaluation,
which dynamically learns the most preferred eviction positions
within the fill stack and prioritizes the ones close to the top of
the stack, belongs to this family. It has been included in our
comparative evaluation because it outperformed many recent
proposals on a set of single-threaded, multiprogrammed, and
multithreaded workloads in [11].
VIII. C ONCLUSIONS
There has been extensive research to improve the behavior of caches, particularly of non-first level ones. Different
approaches are sometimes better suited to reduce different
kinds of misses. For example, DSBC [3] and DIP [4] target
somewhat different problems. If implemented jointly in the
cache, complementary techniques like these ones should have
the potential to achieve better performance than any of them
isolated. Nevertheless, as this paper shows with the case of
DSBC and DIP, implementing them at the same time is not
enough to exploit their advantages. In fact, the direct simultaneous application of these techniques often yields worse
results than the usage of only one of them. We have analyzed
the reasons for this behavior and proposed in a reasoned way
an integrated design of these policies that allows them to
cooperate effectively. As part of this design we demonstrate the
usefulness of the Set Saturation Level (SSL) metric to detect
problems of capacity in the cache. Since this metric already
proved successful at detecting problems of unbalance among
the working set of different sets in [3], it was natural to turn
it into the arbiter of our coordinated approach to fight conflict
and capacity misses.
Simulations using benchmarks with varying characteristics
show that, when properly integrated with our Bimodal Set
Balancing Cache (BSBC) design, the joint application of the
DSBC and BIP policies goes from being often one of the
worst approaches to being the best one. For example in a
2MB, 8-way second level cache DIP+DSBC jointly reduces
the miss rate by 8.3% in relative terms, while DSBC and
DIP reduce it by 12% and 10% respectively. With BSBC
the relative miss rate reduction almost doubles to 16%. This
leads also the BSBC to get the largest IPC improvement,
4.8% on average for this configuration, compared to the 3%
that a straight DSBC+DIP implementation provides. The other
policies tested, DSBC, DIP and probabilistic escape LIFO lay
in between.
Future work includes evaluating and adapting the ideas
developed in this paper in the context of shared caches.
[4] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely Jr., and J. S.
Emer, “Adaptive insertion policies for high performance caching,”
in Proc. 34th Intl. Symp. on Computer Architecture, June 2007,
pp. 381–391.
ACKNOWLEDGMENTS
[15] A. Seznec, “A case for two-way skewed-associative caches,” in
Proc. 20th Annual Intl. Symp. on Computer Architecture, May
1993, pp. 169–178.
This work was supported by the Xunta de Galicia under
project INCITE08PXIB105161PR and the Ministry of Science
and Innovation, cofunded by the FEDER funds of the European Union, under the grant TIN2007-67537-C03-02. Also,
the authors are members of the HiPEAC network. Finally, we
want to acknowledge Mainak Chaudhuri for assistance with
the pseudo-LIFO cache.
R EFERENCES
[1] N. P. Jouppi, “Improving direct-mapped cache performance by
the addition of a small fully-associative cache prefetch buffers,”
in Proc. 17th Intl. Symp. on Computer Architecture, June 1990,
pp. 364–373.
[2] M. K. Qureshi, D. Thompson, and Y. N. Patt, “The V-Way Cache:
Demand-Based Associativity via Global Replacement,” in Proc.
32st Intl. Symp. on Computer Architecture, June 2005, pp. 544–
555.
[3] D. Rolán, B. B. Fraguela, and R. Doallo, “Adaptive line placement with the set balancing cache,” in Proc. 42nd IEEE/ACM
Intl. Symp. on Microarchitecture, December 2009, pp. 529–540.
[5] J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze,
S. Sarangi, P. Sack, K. Strauss, and P. Montesinos, “SESC
simulator,” January 2005, http://sesc.sourceforge.net.
[6] Intel Corporation, “Intel core i7 processor extreme edition and
intel core i7 processor datasheet,” 2008.
[7] Digital Equipment Corporation, “Digital semiconductor 21164
alpha microprocessor product brief,” March 1997.
[8] D. Weiss, J. Wuu, and V. Chin, “The on-chip 3-mb subarraybased third-level cache on an itanium microprocessor,” IEEE
journal of Solid State Circuits, vol. 37, no. 11, pp. 1523–1529,
November 2002.
[9] E. G. Hallnor and S. K. Reinhardt, “A fully associative softwaremanaged cache design,” in Proc. 27th annual Intl. Symp. on
Computer architecture, June 2000, pp. 107–116.
[10] Z. Chishti, M. D. Powell, and T. N. Vijaykumar, “Distance
associativity for high-performance energy-efficient non-uniform
cache architectures,” in Proc. of the 36th Annual IEEE/ACM Intl.
Symp. on Microarchitecture, December 2003, pp. 55–66.
[11] M. Chaudhuri, “Pseudo-lifo: the foundation of a new family of
replacement policies for last-level caches,” in MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium
on Microarchitecture, 2009, pp. 401–412.
[12] HP Labs, “CACTI 6.5,” cacti65.tgz, retrieved in May, 2010,
from http://www.hpl.hp.com/research/cacti/.
[13] Z. Hu, S. Kaxiras, and M. Martonosi, “Timekeeping in the
memory system: predicting and optimizing memory behavior,”
SIGARCH Comput. Archit. News, vol. 30, no. 2, pp. 209–220,
2002.
[14] A. Basu, N. Kirman, M. Kirman, M. Chaudhuri, and J. F.
Martı́nez, “Scavenger: A new last level cache architecture with
global block priority,” in 40th Annual IEEE/ACM Intl. Symp. on
Microarchitecture, December 2007, pp. 421–432.
[16] M. Kharbutli, K. Irwin, Y. Solihin, and J. Lee, “Using prime
numbers for cache indexing to eliminate conflict misses,” in Proc.
10th Intl. Symp. on High Performance Computer Architecture,
February 2004, pp. 288–299.
[17] A. Agarwal and S. D. Pudar, “Column-associative caches: A
technique for reducing the miss rate of direct-mapped caches,”
in Proc. 20th Annual Intl. Symp. on Computer Architecture, May
1993, pp. 179–190.
[18] B. Calder, D. Grunwald, and J. S. Emer, “Predictive sequential
associative cache,” in Proc. of the Second Intl. Symp. on HighPerformance Computer Architecture, February 1996, pp. 244–
253.
[19] C. Zhang, X. Zhang, and Y. Yan, “Two fast and highassociativity cache schemes,” IEEE MICRO, vol. 17, pp. 40–49,
1997.
[20] J. Peir, Y. Lee, and W. W. Hsu, “Capturing dynamic memory reference behavior with adaptative cache topology,” in Proc. of the
8th Intl. Conference on Architectural Support for Programming
Languages and Operating Systems, October 1998, pp. 240–250.