Designing High Bandwidth On-Chip Caches: Kenneth M. Wilson and Kunle Olukotun
Designing High Bandwidth On-Chip Caches: Kenneth M. Wilson and Kunle Olukotun
Designing High Bandwidth On-Chip Caches: Kenneth M. Wilson and Kunle Olukotun
Abstract mary data cache decreases tbe cache’s miss rate, thereby decreasing
In this paper we evaluate the performance of high bandwidth caches the average response time of the processor’s memory system. In-
that employ multiple ports, multiple cycle hit times, on-chip creasing the size of the primary data cache causes the hit time of the
DRAM, and a line buffer to find the organization that provides the cache to increase. Since accessing the data cache is usually on the
best processor performance. Processor performance is measured in processor’s critical timing path, the hit time of larger caches may be
execution time using a dynamic superscalar processor running real- pipelined to avoid increasing the processor’s cycle time. The extra
istic benchmarks that include operating system references. The re- processor cycles can be partially hidden by the use of a dynamic su-
sults show that a large dual-ported multi-cycle pipelined SRAM perscalar processor. If a very large on-chip cache is desired, the de-
cache with a line buffer maximizes processor performance. A large signer may choose to use DRAM in place of SRAM. With
pipelined cache provides both alow miss rate and a high CPU clock improvements in DRAM technology it is now possible to include
frequency. Dual-porting the cache and the use of a line buffer pro- DRAM memory on-chip [Saul96][Shim96]. Caches containing
vide the bandwidth needed by a dynamic superscalar processor. In DRAM can have much larger capacities than standard SRAM cache
addition, the line buffer makes the pipelined dual-ported cache the designs for equal processor die area.
best option by increasing cache port bandwidth and hiding cache la-
tency. One way to increase memory bandwidth and at the same time hide
the increase in hit time needed by a larger cache is to add a small
buffer with a one cycle access time to the processor’s load/store ex-
1. Introduction ecution unit. This level zero cache can be implemented as a small
It is well known that the on-chip memory system of a microproces- line buffer lWiis961. A line buffer holds cache data inside the pro-
sor is a key determinant of microprocessor performance. Previous- cessor’s load/store unit to allow a load to recently accessed data to
ly, the main concerns of memory designers were cache size, be satisfied from the line buffer instead of from the cache. This pre-
associativity, and line size. As the performance of microprocessors vents the load from occupying the primary data cache ports, allow-
increases, instruction and data bandwidth demands lead chip de- ing other loads to access the cache earlier. When a line buffer is
signers to create more complicated on-chip memory systems. Cur- added to a processor with a multi-cycle cache, not only does the line
rently, designers are considering multi-ported caches, multi-cycle buffer reduce the utilization of the cache ports, but the line buffer
caches, multi-level caches, and even caches made with on-chip also reduces the average latency of load instructions by returning
DRAM. Though each of these has been covered individually in data in a single cycle.
previous work [Joup94, MIPS94,Oluk92, Sau196, Shim96, SohiQl,
Wils961, there are no papers that investigate theinteraction of these To understand the interaction between different high bandwidth
different methods to find the organizations that work well together. cache organizations, we evaluate the performance of caches that
employ combinations of multi-port, multi-cycle access, on-chip
There are many options available for increasing cache bandwidth. DRAM, and a line buffer. In the course of this evaluation we find
One straightforward way of increasing cache bandwidth is to pro- the combination of features and cache parameters that provides the
vide multiple cache ports that can be independently addressed by best processor performance. To determine processor performance
the processor’s load/store unit in the same cycle. Ideally, each we use an accurate model of a four issue dynamic superscalar pro-
cache port operates independently, with no increase in hit time and cessor with non-blocking caches. The rest of this paper is orga-
only a small increase in chip area. Another way to increase memory nized as follows: The high bandwidth on-chip cache designs for
bandwidth is to enlarge the primary data cache. Enlarging the pri- this investigation are described in section 2. The experimental
Permission to make digital/hard copy of part or all this work for
methodology for simulating the architectures is discussed in section
personal or classroom use is granted without fee provided that 3. The results that characterize the design space by looking at the
copies are not made or distributed for profit or commercial advan- interactions between different on-chip cache configurations are
tage, the copyright notice, the title of the publication and its date presented in section 4. First, the processor instructions per cycle
appear, and notice is given that copying is by permission of ACM, (IPC) results show the effect of the different cache configurations
Inc. To copy otherwise, to republish, to post on servers, or to
on processor performance when changes in cycle time are not taken
redistribute to lists, requires prior specific permission and/or a fee.
into account. Second, we combine the IPC results with the proces-
ISCA ‘97 Denver, CO, USA
0 1997 ACM 0-89791-901-7/97/0006...$3.50
SOTcycle times determined in section 2 to find the application exe-
cution times of our benchmarks for various cache size and structure.
Section 5 gives our conclusions.
122
Another way to build a multi-ported cache is to duplicate the prima- 2.4. DRAM Caches
ry data cache such as in the DEC Alpha 21164 lEdmo951. The Al- A fairly recent proposal for building an on-chip cache is to use
pha contains two copies of the primary data cache. The dual-ported DRAM on the processor die [Sau196][Shim96]. The main advan-
cache of the DEC Alpha has the performance drawback that all tage of DRAM is that it allows much larger caches to be constructed
stores must be sent to both cache ports simultaneously so that both on-chip. The chief disadvantage is longer cache hit times. Much of
copies of the data cache are consistent. Using both cache ports for the performance effect of thelonger hit times can be hidden through
each store should not significantly degrade performance because the addition of a small first level cache and the latency hiding abil-
stores can be buffered and bypassed to allow loads to access the ities of a dynamic superscalar processor. The extra latency can be
cache first. We make the assumption that stores can be buffered un- further reduced by using a line buffer to return data with high spa-
til both cache ports are not servicing load instructions. This means tial and temporal locality in a single processor cycle.
that stores do not degrade machine performance; therefore ideal
multiple cache ports are used to model duplicate caches. The biggest hurdle to placing DRAM caches on the processor die is
that the technologies to create DRAMS and processors are com-
Even though duplicating the data cache causes a doubling of the die pletely different. With today’s technologies, it is only possible to
area needed for the primary data cache, the only timing cost is the build a low performance processor on a DRAM die; this might be
added delay due to adding one more write port to the load/store useful for microcontrollers and low-end CPUs. High-end CPUs are
buffer. Since adding an additional write port causes only a small in- more likely to use the option of building DRAM on the processor
crease in delay, we assume that the added delay can be eliminated die. Due to the differing technologies this will make the DRAM
with increased effort spent on the circuit design. This means that consume about twice the die area and decrease the DRAM perfor-
the access times for single-ported caches shown in Figure 1 can also mance.
be used for duplicate caches.
One advantage of using DRAM for on-chip caches is that the row
2.2 Pipelined Caches buffers of the DRAM can be combined to make a first level data
With application working sets increasing and the gap between cache. In [Saul96], two row buffers for every DRAM bank were
memory access time and processor cycle time expanding, some used to make a 16 Kbyte primary data cache with two-way-set-as-
chip designers will decide to implement larger caches that require sociativity. The disadvantage to using the row buffers as a primary
multi-cycle hit times. Using multi-cycle hit times allows the use of data cache is that each row buffer contains 512 bytes. This means
larger and more complicated primary caches without impacting thatthelinesizeofthe16Kbytedatacacheis512bytes. Thelonger
processor cycle time. This is especially important for primary data line size reduces the hit rate of the row buffer data cache, but should
caches. The extra hit time of a multi-cycle cache can be partially not have too great an impact since the miss time of the row buffers
hidden by pipelining the cache [Oluk92], and using a dynamic su- is the time to access the DRAM cache. For this study, we use a
perscalar processor [wils95]. DRAM cache size of 4 Mbytes and do not use a secondary off-chip
cache due to the large size of the DRAM cache.
The size of a cache that can be accessed in two or three processor
cycles can be determined from the cycle times shown in Figure 1. The on-chip DRAM hit time is not completely understood at this
If a processor has a cycle time of 25 F04, an 8 Kbyte cache can be time, but [Saul961 predicts a six cycle hit time is possible for a
accessed in one processor cycle, a 512 Kbyte cache can be accessed 2OOMHzprocessor. They also predict that using the DRAM’s row
in 1.67 cycles, and a 1 Mbyte cache can be accessed in 2.20 cycles. buffers as a first level cache allows a cache hit time of one processor
Pipelining requires the addition of a latch with a delay of 1.5 F04, cycle if the data is in the row buffer. These hit times are very ag-
to create a 512 Kbyte cache that can be implemented on-chip and gressive. An actual implementation of a processor in a DRAM
be accessed in two processor cycles. Further, Figure 1 shows that technology [Shim961 used a separate 2 Kbyte SRAM data cache
if a 1 Mbyte cache is desired, either the processor cycle time has to with a one cycle hit time and a miss time to DRAM of five cycles,
be increased to make this cache fit into a two cycle hit time, or the but with a processor frequency of only 66 Mhz. It takes a DRAM
hit time of the 1 Mbyte cache must be increased to three processor cache design similar to [Saul961 to provide enough bandwidth to
cycles. In this study we do not consider on-chip SRAM caches compete with standard on-chip SRAM caches. Due to our concern
larger than 1 Mbyte. for adequate bandwidth we use a one cycle hit time 16 Kbyte two-
way-set-associative row buffer cache, but vary the hit time of the
2.3. Increasing Cache Port Efficiency DRAM cache from six to eight cycles to get an idea of how proces-
A line buffer is a small fully-set-associative multi-ported level-zero sor performance decreases as the hit time of the DRAM increases.
cache located within the processor’s load/store execution unit
[wils96]. When a line buffer is added to a processor’s load/store
execution unit, it reduces the number of accesses to the cache ports 3. Simulation Methodology
by satisfying a large percentage of the load instructions before they This section describes the dynamic superscalar processor used to
access a cache port. By decreasing the number of accesses to the evaluate the different on-chip memory structures, and the SimOS
cache ports, the line buffer reduces the need for multiple cache mose95b] environment under which the processor is simulated.
ports. Another advantage of a line buffer is that it returns recently We also discuss the simulation methods and benchmarks used to
accessed cache data in a single cycle, which helps hide the addition- produce the results presented in this study.
al latency of pipelined caches.
123
Processor Die 3.2. SimOS Environment
SimOS [Rose95b] is a complete machine simulation environment
4 issue dynamic superscalar processor
RIO000 instruction latencies containing models of all the hardware present on modern computer
Processor 64 entry instruction window systems including the CPU, memory subsystems, and I/O devices,
32 entry loadlstore buffer
200 MHz Clock Rate It can simulate the hardware with enough speed and detail to boot
3 $ and run a commercial Unix-based operating system, Silicon Graph-
4 Kbyte - 1 Mbyte primary data cache size
1-3 cycle hit time ics IRIX version 5.3, and arbitrary application programs that run on
I D Two-wav-set-associative
32 byteline size
IRIX.
Cache Cache
4 MSHRs
Perfect one cycle instruction cache This study uses the MXS Penn951 CPU simulator running under
I‘ , T
2.5 GBytels Peak Bandwidth SimOS. MXS is a detailed CPU simulator that simulates the ma-
chine model described earlier in this section. This CPU simulator
Secondary 4 Mbyte secondary cache size
10 cycle hit time models a dynamic superscalar processor that supports multiple in-
Cache Two-way-set-associative struction issue, out-of-order execution, hardware branch prediction,
and lock-up-free caches. The simulator’s operation is paramctcr-
1.6 GByte/s Peak Bandwidth
ized and can be configured to accurately model a variety of mn-
chines.
L--Main Memory
subsystem, there are no restrictions on the type of instructions that SimOS multiprogrammlng benchmarks
can be issued each cycle. pmake 1 Two compilation processes for 17 files 1
database Sybase SQL server uslng bank/cuslomer transadlon
The memory subsystem consists of separate primary instruction processing modeled after the TPC-B transacllon pro-
and data caches, a unified secondary cache, and main memory. The cessing benchmark [Gray931
primary instruction cache is a perfect cache that always hits and re- vcs Simulates the FLASH MAGIC chip ]Kusk94] uslng Iho
sponds in a single cycle. The primary data cache is a two-way-set- I ChronologlcsVCS simulator I
associative cache with 32 byte lines and a single cycle access time, Table 1: The nine benchmarks.
or a fully pipelined multi-cycle access time. Note that the load la-
tency is actually one cycle greater than the cache access time due to 3.3. Benchmarks
the load’s address calculation. The data cache is lock-up-free, but The performance of nine realistic benchmarks is used to evaluate
the cache size is not fixed, allowing us to investigate the perfor- the high bandwidth cache designs that are proposed in section two,
mance effects of various cache sizes. The second level cache is a 4 Table 1 shows that the nine benchmarks are made up of three Inte-
Mbyte two-way-set-associative cache with 64 byte lines and a ten ger SPEC95 benchmarks, three floating point SPEC95 bench-
cycle (50ns) access time. Four miss status handling registers marks, and three multiprogramming applications. The newer
(MSHR) park941 implement the lock-up-free cache and are locat- SPEC95 benchmarks were chosen over SPEC92 because their lnrg-
ed in the primary data cache to support misses for up to four data er working set sizes stress the memory system much more than the
cache lines. Main memory has a sixty cycle (300ns) access time. SPEC92 benchmarks. The three SimOS benchmarks were chosen
because they represent realistic multiprogramming workloads from
For some applications high performance off-chip memory band- the areas of programming, database transaction processing, and cn-
width is critical for good performance. Our memory organization gineering simulation.
supports a bandwidth of 1.6 GBytes/second peak between the sec-
ond level cache and main memory, and 2.5 GByteslsecond peak be- Each benchmark is simulated for over 100 million instructions
tween the processor chip and the second level cache. Though these within an interesting portion of its execution. The brenkdown of
memory bandwidths may seem very impressive, the DEC Alpha execution time spent in kernel, user, and idle for each benchmark IS
21264 [Ke1196]supports even greater bandwidths of 2 GBytes/sec- shown in Table 2. Idle is a result of waiting for I/O, and applica=
ond from the second level cache to memory and 4 GByWsecond tions such as database spend a large portion of their execution in
from the processor die to the second level cache. this mode. Since the processor is spinning in a tight loop in idle
124
mode, the instructions per cycle (IPC) for this mode could skew the IPC changes with cache structure, we present results for a processor
results for benchmarks with significant idle time. For this reason, with a fixed processor cycle time. We consider the effect that pipe-
the IPC of the benchmarks during idle time is not factored into our lining has on IPC for an ideal 32 Kbyte multi-ported cache as the
performance measurements. pipeline depth of the cache is increased from one to three cycles.
We also perform the same study with banked caches and compare
the results to ideal multi-ported caches. Then we show how the ad-
dition of a line buffer increases IPC by hiding memory latency. To
Kernel User Idle Load Store
conclude our study of IPC we look at DRAM caches. Second, us-
w 10.0 90.0 0.0 28.1 12.2 ing our knowledge of IPC we investigate which memory organiza-
Ii 0.2 99.8 0.0 33.2 13.0 tion provides the smallest application execution time for different
1 compress 1 8.4 1 91.6 1 0.0 1 34.5 1 8.0 1 processor cycle times.
tomcalv 0.4 99.6 0.0 28.9 8.5
Before we begin looking at the high bandwidth memory organiza-
su2cor 0.5 99.5 0.0 28.0 6.3
tions, we look at the miss rates per instruction of our nine bench-
apsl 2.2 97.8 0.0 40.0 11.7
marks. The miss rates give an idea of the application working set
I pmake I 8.9 I 86.0 I 5.1 I 25.8 I Il.9 I size and the changes in benchmark performance with cache size.
database 18.4 17.0 64.6 24.8 13.6 Figure 3 shows the effect on the cache misses per instruction of in-
vcs 9.9 90.1 0.0 25.7 15.1 creasing the primary data cache size for each of the nine bench-
Table 2: Execution time percentages plus percentages of marks used in this study. Note that the miss rates for each of our
representative benchmarks are consistent with those of the other
loads and stores in the instruction stream.
two benchmarks within their grouping.
It is interesting to note that the floating point benchmarks almost
never execute in kernel mode, while the integer benchmarks spend From Figure 3 we see that the integer SPEC95 benchmarks have the
up to ten percent of their execution time in kernel mode. In addition lowest miss rates, the multiprogramming benchmarks have much
to improving the accuracy of simulation results for integer bench- larger miss rates, and the floating point SPEC95 benchmarks have
marks, the simulation of kernel mode, as well as user mode, allows slightly lower miss rates on average than the multiprogramming
the simulation of multiprogramming benchmarks such as pmake benchmarks. The integer benchmarks have the smallest miss rates
and kernel intensive benchmarks lie database which spend 9% and because they have working sets that fit into smaller caches. We also
52% of their non-idle time in kernel mode respectively. see that for integer benchmarks changes in miss rate due to increas-
es in cache size tend to be incremental. The multiprogramming
benchmarks are integer applications with larger working sets than
4. Results the integer SPEC95 benchmarks, so the change in miss rate with in-
To characterize the performance of the different on-chip cache de- creasing cache size responds similarly for both; however the multi-
signs we simulate them with cache sizes varying from 4 Kbytes to programming application’s miss rates are greater. The floating
1 Mbyte. The simulation results are shown for one representative point applications have large working sets and radical drops in miss
benchmark from each of the three groups. Gee represents the inte- rates at specific cache sizes. Their response to increasing cache size
ger SPEC95 benchmarks, tomcatv the floating point SPEC95 is different from the integer applications because floating point ap-
benchmarks, and database the multiprogramming benchmarks. plications tend to access their data in more regular patterns than in-
The results are presented in two stages. First, to understand how
9% . . . . . . -
8% . . . . . . .
- gee
....-.. teger applications.
1
- tomcatv
. . . . - . .
3 -+ compress -+- apsi
,7x-. . . . . *. . . . . . . .
-+- su2cor
125
2.27 f--J 1 Port 2 Ports n 3Ports n 4Port.s TT] 1Bank i 2 Banks w 4 Banks n 8 Banks q 128Bnnlu
l- 2- 3- I- 2- . . .
l- 3” I- 3-
gcc tomcatv database tom%atv data&se
figure 4: The performance of ideal multi-cycle multi- Figure 5: The performance of 32 Kbyte multi-cycle banked
ported 32 Kbyte caches with a ftxed processor cycle time. caches with a fixed processor cycle time.
A ‘I-” is used to denote cycle
Note that where banked caches are discussed we are intcrestcd in
4.1. Multi-Cycle, Multi-Ported Caches the number of external banks where each external bank has its own
We begin by presenting the IPC results of ideal multi-cycle caches cache port. These results hold for all nine benchmarks and validate
with a fixed processor cycle time. For caches with multi-cycle hit our focus on eight-way banked caches in this study.
times, a key question is: How mlich performance is lost as the hit
time, measured in processor cycles, increases? Figure 4 shows how By comparing Figure 5 to Figure 4, we see that the decrease in pcr-
machine performance decreases with increasing hit time for one to formance for multi-cycle hit times is about the same for an elght-
four ideal cache ports with a 32 Kbyte primary data cache for gee, way banked cache as it is for caches with two ideal cache ports, In
tomcatv, and database. We see from Figure 4 that the decrease in addition, the slope of the performance decrease shown in Figures 4
performance for the benchmarks with multi-cycle hit times is al- and 5 is fairly constant for each benchmark as the hit time is in-
most independent of the number of ideal cache ports. In the case of creased from one to three cycles. The results in Figures 4 and 5 nrc
integer applications, multi-cycle hit times degrade machine perfor- for a fixed cache size of 32 Kbytes. In section 4.4 we investigate
mance considerably. For the integer application gee with two ideal the performance effect of changes in cache size, processor cycle
cache ports, performance declines by 18% when the cache hit time time, and cache hit time.
is increased to two cycles, and performance declines another 15%
when the cache hit time is increased to three cycles. Without the 4.2. Line Buffer with Multi-Ported Multi-Cycle Caches
use of a dynamic superscalar processor there would be a larger per- The use of a 32 64-bit entry fully-set-associative multi-ported line
formance decrease given the frequency of loads. The dynamic su- buffer allows a large percentage of load instructions to be satisfied
perscalar processor is able to hide a portion of the additional latency from the buffer in a single cycle instead of from the data cache.
of the pipelined cache by reordering instruction execution to start Figure 6 shows the IPC of the three representative benchmarks for
the cache accesses as early as possible and rescheduling the load 32 Kbyte eight-way banked and duplicate caches of one, two, and
uses when necessary. For floating point applications, machine per- three cycle hit times with and without a line buffer. In the legend
formance does not decrease nearly as much due to the large amount of Figure 6, the label “LB” is added to show when a line buffer 1s
of instruction level parallelism (ILP) available. The dynamic su- incIuded in the processor’s load/store execution unit, In all cases
perscalar processor uses the ILP available in tomcatv to hide a the addition of a line buffer increases machine performance, but for
greater portion of the multi-cycle hit time than is possible for inte- the eight-way banked cache with a single cycle hit time the increase
ger applications. Tomcatv shows only a 3% drop in performance in machine performance is only about one-half of a percent, The
for a two cycle hit time cache and another 3% drop in performance reason for this small increase is that the line buffer was originally
for a three cycle hit time. For database, a larger integer application, proposed to decrease the cache port access pressure when there IS
the performance decrease is 13% and 12% for two and three cycle only one cache port. In a four issue processor with more than two
hit tim’esrespectively. The smaller decline compared to gee is due cache ports, the load/store unit does not tend to get backed up with
to database’s larger working set which causes a higher miss rate and loads waiting to access the cache port; therefore, there is little pcr-
therefore reduces the performance impact due to increases in cache formance gain due to the addition of a line buffer. The only cxccp-
hit time. tion is for very small caches, where the extra associativity provided
by the line buffer increases processor performance.
Figure 5 shows the performance that results with banked caches of
1 to 128 banked ports and one to three cycle hit times for a 32 Kbyte On the other hand, a single cycle duplicate cache only contains two
primary data cache and fixed processor cycle time. By comparing cache ports; therefore decreasing the cache port access pressure
Figure 5 with Figure 4, we’see that a four-way banked cache pro- with a line buffer provides a greater processor performance increase
vides lower performance than a cache with two ideal cache ports, of three percent. A previous study [wils96] showed that using four
while an eight-way banked cache provides higher performance than ideal cache ports instead of two ideal cache ports increased perfor-
a cache with two ideal cache ports. The 128-way banked cache re- mance by less than five percent for a one cycle 32 Kbytc cache, A
sults show that the performance difference between an eight-way line buffer achieves over 60% of that performance improvement
banked cache and a cache with a large number of banks is small. without increasing the number of cache ports from two to four,
126
- gcc.LB --9- gee t tomcatv.LB .-A.- tomcatv t database.LB - -v. - database
1.8-.
1.7-.
1.6-. . . . . . . . . . . . :?& . . . . . . . : _ . . .
0 z
a 1.5-.
1.42.
1.3-.
1.2-.
. . . . . . . . *“----..p&
. . . . .-_‘--..zm. . . . . . .
----___
1.2L
4.3. Performance of DRAM Caches
--“---A
Figure 7 presents the performance results for our dynamic super- 1.1 -
scalar processor with a 4 Mbyte DRAM on-chip data cache, a 16 I I I
6- 7- 8-
Kbyte two-way-set-associative single-cycle cache created from the
DRAM Hit Time
DRAM row buffers, and no 4 Mbyte SRAM off-chip cache. The
hit time of the DRAM is varied from six to eight cycles to show how Figure 7: Multi-cycle 4 Mbyte DRAM cache with an
longer DRAM latencies will affect machine performance. Only the eight-way banked 16 Kbyte row buffer cache and fixed
results for eight-way banked DRAM caches are presented because processor cycle time.
a4 Mbyte cache is large and the DRAM’s row buffers act as banks.
Figure 7 shows that a line buffer provides a negligible improvement
in the performance of gee and database but drastically increases the
average of the nine
performance of tomcatv. For an eight-way banked one cycle hit
time cache, a line buffer is not really needed to reduce the access benchmarks tomcatv
pressure on the cache ports. Therefore, we do not expect an im- 2.2 . . . . . . . . . . . . . . . . . ...*. . . . 8
provement from the line buffer unless very small caches areused or 3 a4
there are a great number of conflict misses that can take advantage ,I’
of the extra associativity provided by the line buffer. Database 2.0 . . . . . . . . . . . . 3 . . . . . . . * ..a
*WV
shows no performance increase due to the line buffer, but gee gains .J ..
more incremental performance from adding a line buffer to DRAM ..I ,Q
,,’ ,J -0’
caches than to SRAM caches. This occurs because the use of 512 1.8 ’ . ..o: . . . . . * .
-,,” Iv’ /
byte cache lines reduces processor performance by increasing the ,’
f*‘.@
number of conflict misses and therefore increasing the importance /‘.0-
se6 I* -.-dz-. ppqk+!teq3+3. .
of the extra associativity added by the line buffer. The line buffer
increases the performance of tomcatv 18% by reducing the conflict B’:
:
misses that are caused both by the increase in line size, and the
small size of the 16 Kbyte cache. With a line buffer, the perfor-
mance cost of using the 16 Kbyte two-way-set-associative 512 byte
line row buffer cache instead of an equivalent SRAM cache with 32
byte lines is 17%, 6%, and 6% for tomcatv, gee, and database re-
spectively.
6-4MbytcDRAMCti
..A..
/db
the hit time of the DRAM cache is not that significant because of
.\c
..:;.I&.
.,P
,.
.*eY
..
the single-cycle hit time row buffer cache. The same amount of
performance is lost with an increase in DRAM hit time of six to sev-
database
en cycles as is lost for an increase in hit time from seven to eight 2.2 .............
cycles. On average for all nine benchmarks, the processor perfor-
mance drops by 3% for each one cycle increase in the DRAM hit
7
/’ b*
time.
.,f
,‘.
. ...
. . * . t .
$$3
2.0
’
. . . . . . 4 .
: 9’
:’
4.4. Comparing Design Alternatives
d
To determine which on-chip memory organization provides the *
-1/r
1.84 . .
best IPC, we simulate the nine benchmarks with cache sizes ranging
from 4 Kbytes to 1 Mbyte, and cache hit times of one to three pro-
cessor cycles. As expected, the best performing organizations are
single cycle access duplicate caches and eight-way banked caches
with a line buffer inside the processor’s load/store execution unit.
1.4 . . . . . . . . . .
Figure 8 shows how the IPC of the processor increases with in-
creasing cache size for gee, tomcatv, database, and the average of
the nine benchmarks. This is done for duplicate and eight-way 1.24. . - . . . . . . . . . .
banked pipelined caches of one to three cycles with a line buffer.
The datapoint for the 4 Mbyte DRAM cache with a one cycle 16
Kbyte row buffer cache and a six cycle DRAM hit time is also
shown. The curves for both gee and database have the same char-
acteristics as the curves for the average of the nine benchmarks.
Tomcatv behaves slightly differently. For larger cache sizes, the
eight-way banked cache just barely outperforms the duplicate Figure 8: IPC of multi-cycle banked and duplicata UIC~ICS
cache. Large eight-way banked caches perform better because with a line buffer and fixed processor cycle time,
floating point applications, lie tomcatv, are more likely than inte-
ger applications to take advantage of the extra cache ports by hav-
ment, lowering the cache’s ability to take advantage of spatial lo-
ing more than two non-conflicting load instructions ready to access
cality and therefore increasing the miss rate. As the miss rate
the cache in a single cycle. On the other hand, for a small cache
decreases with larger caches, the chance of replacing a cache lint
tomcatv performs significantly better with a duplicate cache than an
decreases, and the performance advantage of more cache ports al-
eight-way banked cache. The higher miss rate of the small cache
lows theeight-way banked cache to outperform the duplicate cache.
causes cache lines to be replaced more frequently. For tomcatv, the
banked cache tends to group cache accesses to different cache lines
In Figure 8 we see that on average the performance of a duplicate
within a single cycle. This increases the rate of cache line replace-
cache is greater than that of the equivalent eight-way banked cache
even though Figures 4 and 5 show us the opposite result. The
Average of the nine tomcatv
2.4 __.-_.__.__.__.._...__._ -.-.-._..-....___.__.___.
change is due to the inclusion of a line buffer within the load/store benchmarks
2.3 .._...__-._____..._.__.. ~._..__.___._____..___.._
execution unit of the processor. Earlier in this study we showed that i- Duplicate Cache
without the line buffer, eight-way banked caches always outper- .__._____.___..__.____~__
form duplicate caches. Although an eight-way banked cache has 2- Duplicate Cache
.-.--.__-.___...__._._.._
the drawback of bank conflicts, the performance penalty is not as 3- Duplicate Cache
--..-.._--...__._.____.._
severe as being restricted to two cache ports. Adding a line buffer
does not significantly reduce bank conflicts, but does reduce port
contention; therefore, a duplicate cache with a line buffer can out-
perform an eight-way banked cache with a line buffer. Further-
more, the cache access time of an eight-way banked cache is always
greater or equal to the cache access time of a duplicate cache of the
same size. Consequently, on average a duplicate cache with a line
buffer will outperform an eight-way banked cache with a line buff-
er. For this reason we only examine duplicate caches to find the
best cache organization for a given processor cycle time.
The IEC of a six cycle hit time DRAM cache with a one cycle row
buffer cache and a line buffer is also included in Figure 8. Even
with the optimistic hit time simulated for the DRAM cache, on av-
erage the DRAM cache does not perform as well as a 16 Kbyte
SKAM cache. The reasons for this are first, the 512 byte lines of
the row buffer cache cause a significant degradation in processor
performance due to conflict misses that is only partially alleviated
by the use of a line buffer, and second, the 16 Kbyte row buffer
cache is not large enough to makeup for the six cycle latency of the 2.4 . . . . . . . . . . SW
. . . . . . . . . . . ...‘“.
database
-.__-..__..__._...___.___
DRAM. A DRAM cache can only compete with a SRAM cache if
2.3 _...__.._..___._____.____.
the row buffer cache is increased to 32 Kbytes and the performance 1 /
degradation due to the use of 512 byte lines can be hidden.
129
One thing to note from Figure 9 is that even with a fixed cache size, best performer. In between these two extremes the best organlza-
a given percentage reduction in processor cycle time results in a tion is dependent upon both cycle time and working set size. For
smaller percentage reduction in execution time. This is due to the gee we see that if the processor can accommodate a single-cycle 16
fact that a significant portion of an application’s execution time is Kbyte cache then the cache should not be pipeline& however datn-
spent in the memory system. For example, the performance of tom- base has a larger working set than gee, so its results lead to a dlffcr-
catv only increases by a factor of 1.5, even though the processor cy- ent conclusion. In fact, if we simulate only small applications with
cle time is reduced by a factor of three. We find that tomcatv working sets that can fit in a 4 Kbyte cache, there is be no point in
spends roughly half of its execution time in the memory system. using pipelined caches unless the processor’s frequency is so fast
Using a speedup of three in processor cycle time and fifty percent that a one cycle 4 Kbyte cache cannot be built. Only applications
of the execution time spent in the processor, Amdahl’s Law with large realistic working sets need the larger caches that are pos-
[Henn96] predicts an overall performance increase of 1.5 times. sible with pipelining.
The non-pipelined cache results of Figure 9 can be compared to the From Figure 9 we see that a processor cycle time of 29 F04 can ac-
results of Jouppi and Wilton [Joup94] for single-level on-chip cach- commodate a one cycle 64 Kbyte duplicate cache. Even with data-
es. They used cacti for the access times of their caches and a trace base’s working set size, a 64 Kbyte single cycle cache outperforms
driven memory simulator to simulate the performance of 1 Kbyte to any pipelined cache that can be built. For processor cycle times of
256 Kbyte direct mapped blocking caches. We also use cacti for less than 29 F04, the larger pipelined cache maximizes processor
our cache access times, but we simulate a dynamic superscalar pro- performance. In the unlikely event that a dynamic superscalar pro-
cessor with two-way-set-associative non-blocking caches. The dy- cessor can be built in 15 F04, a 1 Mbyte three cycle cache provides
namic superscalar processor and the non-blocking caches hide part the best performance, but a two cycle 32 Kbyte cache should be
of the performance lost due to small cache sizes, and a two-way-set- considered for its smaller die area at the cost of a small drop in pcr-
associative cache will perform about as well as a direct-mapped formance. If a processor can be built with a cycle time of 10 F04,
cache of twice its size [Henn96]. These differences tend to increase then at least three cycles of pipelining are required.
the processor’s performance for smaller size caches and decrease
the incremental performance gained from doubling the cache size. Figure 9 shows the best cache size and pipelining necessary to mln-
imize application execution time for a given processor cycle time,
Our two studies overlap for gee and tomcatv executed with single- When the processor cycle time is not fixed, we find that on average
cycle single-level caches varying in size from 4 Kbytes to 256 a 32 Kbyte cache minimizes execution time for each cache pipeline
Kbytes. Jouppi and Wilton show that the best performance with a depth. This shows that on average our application working sets fit
single level of cache is achieved for SPEC92 gee with a 32 Kbyte well enough within a 32 Kbyte cache that increasing the cache slzo
direct mapped primary data cache. For SPEC95 gee, we find that does not sufficiently decrease the cache miss rate to compensate for
for an aggressive dynamic superscalar processor, the best perfor- the resulting increase in cycle time.
mance for gee is obtained with a two-way set-associative 32 Kbyte
duplicate cache. In the case of SPEC92 tomcatv, Jouppi and Wilton
found that only a 16 Kbyte data cache was required, but we find that 5. Conclusion
a 32 Kbyte duplicate cache is needed to provide the best execution In this study we have investigated a number of on-chip data cache
times. Larger set-associative caches are required due to the larger organizations that include multi-ported caches, multi-cycle caches,
working set sizes of the SPEC95 benchmarks. Our caches are also line buffers, and DRAM caches. We have evaluated these cache or-
multi-ported to support the additional cache bandwidth required by ganizations with simulations that include operating system rcfcr-
a dynamic superscalar processor. ences for nine realistic benchmarks with primary data cache sizes
ranging from 4 Kbytes to 4 Mbytes. A dynamic superscalar proccs-
The amount of cache pipelining that optimizes performance de- sor with a lockup-free fully-pipelined cache was used to maxlmizo
pends on both the working set size of the applications that are exe- the bandwidth demands on the cache. The processor cycle times
cuted on the dynamic superscalar processor and the processor’s were varied from a cycle time of 29 F04, that can accommodate a
cycle time. Executing with larger application working sets increas- 64 Kbyte one cycle single-ported primary data cache, to cxtrcmcly
es the cache miss rate, requiring a larger cache to minimize execu- fast processors that have a processor cycle time of only 10 F04.
tion time. Refer to Figure 3 to see how the larger cache sizes
available due to pipelining reduce miss rate. Decreasing the proces- For a fixed cache size and processor cycle time, increasing the num-
sor’s cycle time makes it harder to build a cache large enough to ber of cache ports through cache banking and cache duplication in-
minimize execution time without pipelining the cache. Therefore, creases processor performance, while increasing the amount of
the combination of a large working set size and a small processor pipelining in the cache decreases processor performance. In the
cycle time require large pipelined caches to produce the best pro- case of ideal cache ports, increasing the number of ports from one
cessor performance. to two results in a 25% increase in processor performance. An in-
creasein the number of cache ports from two to three achieves a 4%
In Figure 9 we see that the best performing cache organization is in- increase in performance and an increase from three to four ports in-
deed based upon both processor cycIe time and working set size. If creases performance by less than 1%. Duplicating the cache results
the processor’s cycle time is large enough, a single-cycle 64 Kbyte in processor performance comparable to a cache with two ideal
duplicate cache will provide the best performance, but if the proces- cache ports. Adding extra cache ports by banking the cache also ln-
sor’s cycle time is less than 25 F04, a pipelined cacheis always the creases processor performance, but much more slowly than adding
130
ideal cache ports. We used cacti to analyze an eight-way banked Acknowledgments
cache and demonstrated that banking the cache increases the hit We would like to thankMende1 Rosenblum, Steve Herrod, Edouard
time for small cache sizes. Larger caches, however, already use an Bugnion, and Robert Bosch for their help with SimOS, Mark
internal banking structure. Looking at IPC and ignoring changes in Horowitz for his help with cache design, and the reviewers for their
cycle time, we found that a processor with a duplicate cache will al- insightful comments. This work was supported by ARPA contract
ways outperform a four-way banked cache but that an eight-way DABT63-94-C-0054.
banked cache will always outperform a duplicate cache. For this
reason we concentrate on eight-way banked caches and duplicate
caches in this study. When pipelining is added to the cache with References
cache size and cycle time held constant, we see a 12-23% decrease [Aspr93] Tom Asprey, Gregory S. Averill, Eric D&no, Russ Mason, Bill Weiner,
in IPC per pipeline stage for integer applications but only a 3-9% and Jeff Yetter, “Performance Features of the PA7100 Microprocessor”, IEEE Micro,
June 1993, pp. 22-35.
decrease for floating point applications.
[Benn95] James Bennett and Mike Flynn, “Performance Factors for Superscalar Pro-
Since adding additional cache ports can be expensive both in die cessors”, Technical Report CSLTR-95-661, Computer Systems Laboratory, Stanford
University, Feb. 1995.
area and cache cycle time, a line buffer can be used to reduce the
performance loss due to building fewer cache ports. The use of a powh95] William J. Bowhill, et al, “Circuit Implementation of a 300-MHz 64bit
Second-generation CMOS Alpha CPU”, DigifuITechnicaIJoumal, Special 10th An-
line buffer with a duplicate cache improves processor performance niversary Issue, Vol. 7, No. 1.1995, pp. 100-l 18.
by 3%, while a line buffer with an eight-way banked cache im-
proves performance by only 0.5%. In fact, the performance im- [Chap911 Terry I. Chappell, Barbara A. Chappell. Stanley B. Schuster, James W.
Allen, Stephen P. Klepner, Rajiv V. Joshi, and Robert L. Franch, “A 2-ns Cycle, 3.8-
provement due to a line buffer used with a duplicate cache changed ns Access 512~kb CMOS ECL SRAM with a Fully Pipelined Architecture”. IEEE
the results so that the IPC of a duplicate cache with a line buffer is Joumaf ofSoiid-Store Circuirs, Vol. 26, No. 11, November 1991. pp. 1577-1585.
on average always greater or equal to the IPC of an eight-way
[Chen92] Tien-Fu Chen and Jean-Loup Baer, “Reducing Memory Latency via Non-
banked cache with a line buffer. In addition, the use of a line buffer blockingandPrefetcbingCaches”,ASPUIS-V, Boston,Massachusetts, October 12-15,
with pipelined multi-ported caches decreases the performance drop 1992.
due to pipelining by 28%-61% for integer applications and 50%- [Chen94] Chung-Ho Chen and Arun K. Somani, “A Unified Architectural Tradeoff
74% for floating point applications. Methodology”, Proceedings of the2lstAnnual Inrernarional Symposium on Computer
Archirecrure, Chicago, Illinois, April 18-21, 1994, pp. 348-357.
Combining cycle time, cache size, and cache structure to find the [Conte921 Thomas A. Conte, “Tradeoffs in Processor/Memory Interfaces for Super-
cache organization with the best performance on our nine bench- scalar Processors”, Proceedings of rhe 2khAnnuol International Symposium on Mi-
cronrchirecfure, Portland, Or 1992.
marks, we found that the use of a large pipelined duplicate cache
with a line buffer almost always results in the best processor perfor- [Cvet94] Z.&a Cvetanovic and Dileep Bhandarkar, “Characterization of Alpha AXP
mance. Since the use of a line buffer always increases processor Performance Using TF’ and SPEC Workloads”, Proceedings of the 2Jst Annual Inrer-
national Symposium on ComputerArchitecntre. Chicago, Illinois, April IS-21.1994,
IPC without a corresponding increase in processor cycle time and pp. 60-70.
the cycle time of a duplicate cache is always less than or equal to
the cycle time of an eight-way banked cache, a processor performs @hno95] John H. Edmondson, et al, “Internal Organization of the Alpha21164 a
300-MHz&l-bit Quad-issue CMOS RISC Microprocessor”. Digital Technical Jour-
better with a duplicate cache instead of an eight-way banked cache nal, Special 10th Anniversary Issue, Vol. 7, No. 1.1995, pp. 119-135.
as long as a line buffer is included within the load/store execution
unit of the processor. We also found that the use of an aggressive park941 Keith I. Farkas andNorman P. Jouppi, “ComplexitymerformanceTradeoffs
with Non-Blocking Loads”, Proceedings of rhe 2lstAnnuaI Inremarional Symposium
six cycle hit time 4 Mbyte DRAM cache with a one cycle eight-way on Computer Archirecrure, Chicago, Illinois, April IS-21.1994. pp. 211-222.
banked 16 Kbyte two-way-set-associative primary row buffer
fFan941 Mathew Farrens, Gary Tyson, and Andrew R. Pleszktm, “A Study of Single-
cache and a line buffer, on average, provides less processor perfor- Chip Processor/Cache Organizations for Large Numbers of Transistors”, Proceedings
mance than an equivalent 16 Kbyte SRAM cache with a ten cycle of rhe 2lstAnnual Inremational Symposium on Cornpurer Architecture, Chicago, Illi-
hit time 4 Mbyte secondary cache. nois, April IS-21,1994, pp. 338-347.
[Gee931 Jeffrey D. Gee, MarkD. Hill, Dionisios N. Pnevmatikatos, and Alan Jay
To minimize application execution time, the pipeline depth of the Smith, “Cache Performance of the SPEC92 Benchmark Suite”, IEEE Micro, August
cache must be determined by the processor’s cycle time. For a pro- 1993, pp. 17-27.
cessor with a slow cycle time of 29 F04, a 64 Kbyte dual-ported [Gray931 Jim Gray, Ed., ‘The Benchmark Handbook for Database and Transaction
single-cycle cache provides the best processor performance. On the Processing Systems”, Morgan Kaufmaun Publishers, 1993.
other hand, for processor cycle times between 29 F04 and 24 F04, [Gwen941 Finley Gwennap, “MIPS RlOOOOUses Decoupled Architecture”, Micro-
a two cycle cache delivers the best performance if there is enough processor Report, Volume 8, Number 14, October 24,1994, pp. 18-22.
room on the processor die for the larger two cycle cache. For pro-
[Henn96] John L. Hennessy and David A. Patterson, “Computer ArchitectureaQuau-
cessor cycle times of less than 24 F04, the cache must be pipelined titative Approach”, Morgan Kaulinann Publishers, Inc. Second Edition, 1996.
since the processor cannot support a single-cycle non-pipelined
cache of even 4 KBytes. [Horo92] M. Horowitz, S. Pnybylski, and M. D. Smith, ‘Tutorial on Recent Trends
in Processor Design: Reclimbing the Complexity Curve”, Western Instiwe of Com-
puter Science, Stanford University, 1992.
[Rose95b] Mendel Rosenblum. Stephen A. Herrod, Emmett Witchel, and Anoop Gup
ta, “Complete Computer System Simulation: The SimOS Approach”, IEEEParaltel
and DistruburedTechnology, Volume 3. Number 4, Fall 1995.
[Saul961 Ashley Saulsbury, Fong Pong, Andreas Nowatzyk, “Missing the Memory
Wall: The Case for Processor/Memory Integration”, Proceedings of the 23rdAnnual
International Symposiumon Computer Architecture, May 22-24,1996, pp. 90-101.