Cache Memory Scheme
Cache Memory Scheme
Cache Memory Scheme
Proceedings of the 12th IEEE International Conference and Workshops on the Engineering of Computer-Based Systems (ECBS’05)
0-7695-2308-0/05 $20.00 © 2005 IEEE
memory speed is large. Increasing associativity also has with a hierarchy level of h has associativity of
the advantage of reducing the probability of thrashing. C / k h blocks. The first k-1 sets have the largest size of
Repeatedly accessing m different blocks that map into the C/k blocks while the last k sets contain a minimum of 1
same set will cause thrashing. A cache with an block.
associativity degree of n can avoid such thrashing if In this scheme, k is a parameter to adjust the
n t m and LRU replacement policy is employed [26]. A associativity used in the cache. In other words, if k is 1
hash-rehash cache [1] uses two mapping functions to then the HMAM cache is a fully-associative cache and if -
determine the candidate location with associativity of two k is C then the HMAM cache is a direct-mapped cache.
and by sequential search, but higher miss rate results from Also, the HBAM cache [22] is a HMAM cache with k of
non-LRU replacement. Agarwal et al. [2] proposed the 2. For typical size of k (i.e., 2, 4, 8, 16), the cost of this
column-associative cache that improves the hash-rehash scheme is less than the cost of fully-associative scheme.
cash by adding hardware to implement LRU replacement Due to the separated logic associated to each set, its
policy. The predicative sequential associative cache operation is relatively faster compared to fully-associative
proposed by Calder et al. [5], uses bit selection, a mapping and slower than set-associative mapping [23,
sequential search and steering bit table, which is indexed 24]. In this scheme, address translation is performed
by predictive sources to determine search order. However dynamically (value based) instead of statically (bit
this approach is based on prediction, which may be selection) in the direct mapping scheme. This means that
incorrect and has slightly longer average access time. there is no predefined specified format to determine set
Skewed-associative cache [17] increases associativity in number of a memory address in the cache. The set
orthogonal dimension using skewing function instead of number should be determined using the address pattern
bit selection to determine candidate locations. The major coming from the CPU. As an example, Figure 1 shows
drawbacks of this scheme are a longer cycle time and the the set organization of a 16-block HMAM cache with k
mapping hardware necessary for skewing. equal to 4 (4HMAM), for a main memory of 128 blocks.
Ranagathan [16] proposed a configurable cache Table 1 portrays the address mapping and required bits
architecture useful for media processing which divide the for tag storage in the 4HMAM cache where its block size
cache into partitions at the granularity of the conventional b
is 2 words. The number of sets in 4HMAM cache is
cache. The key drawback of it is that the number and
granularity of the partitions are limited by the 3 log C4 1 .
associativity of the cache and also it causes to modify the Block Tag
hardware of the cache to support dynamic partitioning Block 0
and associativity. Another configurable cache architecture
Block 1
has been proposed in [25], which intended for some Set# 1 4m + 3
Block 2
specific applications of embedded systems. The cache
mapping function can be configured by software to be Block 3
direct mapped, 2-way, or 4-way set-associative caches. Block 4
Block 5
Hierarchy Set# 2 4m+2
3. HMAM Organization level# 1 Block 6
Block 7
In the HMAM organization the cache space is divided Block 8
into several variable size associative sets in a hierarchical
Block 9
form. Let C be the cache size in blocks and k be division Set# 3 4m+1
Block 10
factor used for dividing the cache space. In HMAM, the
cache space is divided into k different sets with numbers Block 11
Proceedings of the 12th IEEE International Conference and Workshops on the Engineering of Computer-Based Systems (ECBS’05)
0-7695-2308-0/05 $20.00 © 2005 IEEE
Table 1. Address mapping and required bits for tag in each set
Logical condition in address Bit length of
Set # Tag storage bits Associativity in set
decoder tag array
1 Ab1 Ab 1 An1downto Ab2 nb2 C/4
2 Ab1 Ab 1 An1downto Ab 2 nb2 C/4
3 Ab1 Ab 1 An1downto Ab 2 nb2 C/4
4 Ab3 Ab2 Ab1 Ab 1 An1downto Ab4 nb4 C/4
… … … … …
3 log4 C 3 A b 2 log C4 3 Ab 2 log C 4 ... Ab 1 An1downto Ab 2 logC 2 n b 2 (log4 C 1) 4
4 4
needed for upgrading the simulator consist were: penalty. The basic parameters for the simulation are: CPU
modification of set determination function, which was clock =200 MHz, Memory latency=15 CPU cycles,
used in set-associative cache, and development of a new Memory bandwidth=1.6 Gb/s. The hit time is assumed to
function that finds number of associative lines and tag be 1 CPU cycle. These parameters are based on the values
bits according to the obtained set. Benchmarks used in for common 32-bit embedded processors (i.e., ARM920T
this trace-driven simulation included several different [3] or Hitachi SH4 [9]).
kinds of programs of SPEC2000 benchmarks [19], The average memory access time for the conventional
namely bzip2, apsi, swim, vortex, eon_cook, eon_rush, fully-associative, direct-mapped and the various HMAM
gcc, gzip, parser, sixtrack, and vpr. Each file contains at cache are shown in Figure 3. As a result of benchmark
least 10M references. Both data and instruction references analysis, application with high degree of locality like gzip
are collected and used for the simulation. Three well- shows particularly higher performance improvement in
known placement schemes, i.e., the direct, set-associative, using HMAM cache. As shown in these figures, when k is
and fully-associative mapping are chosen for performance relatively low, HMAM behaves very closely to the fully-
comparison and evaluation. associative cache in miss ration and the average memory
Two major performance metrics i.e., the miss (hit) access time instead of the conventional set-associative
ratio and the average memory access time are used to cache.
evaluate and compare the HMAM cache with other In the case of simulating the k-way set-associative
schemes. cache, several values of k, have been considered. For
The cache miss ratios for the conventional fully- brevity, only a selected figure is shown here. Figures 5
associative (FA), 4-way set associative (4WSA), direct- shows the plot of hit ratio against block size for a cache
mapped (DC) and the proposed HMAM cache with size of 32 KB and typically selected benchmarks, bzip2. It
several values of k are shown in Figure 2. For the fully- illustrates comparative performance of the conventional
associative cache denoted as FA in the figure, the notation placement schemes (fully-associative, set-associative and
“32k-8byte” denotes an 32KB full-associative cache with direct mapping) and the HMAM scheme. Notice that
a block size of 8 bytes. there is a general trend of the HMAM scheme exhibiting
Notice the average miss ratio of the HMAM cache for higher hit ratios (except for fully-associative scheme). It
a given size (i.e., 32KB) is very close to the FA. The can be seen, in figures 5, that the HMAM scheme
HMAM cache is approaching to 4WSA when k is grown. outperforms the set-associative and direct mapping
Another useful measure to evaluate the performance of schemes for a wide variety of cache configurations.
any given memory-hierarchy is the average memory
access time, which is given by 5. Cost and Power Consumption Analysis
Averagememoryaccess time (1) 5.1 Hardware Complexity
In order to reduce latency of tag comparison in fully-
Hit time Miss rate . Miss penalty associative caches, these memories are constructed using
CAM (content addressable memories) structures. Since
Here hit time is the time to process a hit in the cache each CAM cell is designed as a combination of storage
and miss penalty is the additional time to service the miss
Proceedings of the 12th IEEE International Conference and Workshops on the Engineering of Computer-Based Systems (ECBS’05)
0-7695-2308-0/05 $20.00 © 2005 IEEE
Figure 2. Miss ratio (%) of fully-associative, several HMAM and direct mapped cache for various benchmarks
Figure 3. Average memory access time in cycle for fully-associative, several HMAM and direct mapped cache using various benchmarks
and comparison logic, the size of a CAM cell is about RAM = 0.6 ¬ª#entries + #Lsense amplifiers ¼º> (#data bits + #statusbits) +Wdriver @
double as that of a RAM cell [15]. For fair (2)
Performance / Cost analysis, the performance and where #entries is the number of rows in tag array or
cost for various direct, set-associative and HMAM caches data array, #Lsense amplifiers is the length of a bit-line sense
are evaluated. The metric used to normalize cost-area
analysis is rbe (register-bit equivalent). amplifier, #data bits indicates the number of tag bits
We use the same quantities used in [8] [10], where the (or data bits) of one set, #status bits is the state bit of
complexity of PLA (programmable logic array) circuit is one set, and Wdriver is the data width of a driver.
assumed to be 130 rbe, a RAM cell as 0.6 rbe, and a CAM
cell as 1.2 rbe. The area of CAM can be given by [15]
The RAM area can be calculated as [15] CAM = 0.6 ª¬ 2 #entries + #L sense amplifiers º¼ ª¬ 2 # tag bits+ Wdriver º¼
Proceedings of the 12th IEEE International Conference and Workshops on the Engineering of Computer-Based Systems (ECBS’05)
0-7695-2308-0/05 $20.00 © 2005 IEEE
(3) where N bp , N bw , and N br are the total number of
where # tag bits is the number of bits for one set in the
transitions in the bit lines due to precharge, the number of
tag array. writes, and the number of reads, respectively. B is the size
The total area can be given by of block in bytes, T is the tag size of one set, and S
Area = RAM + CAM + PLA (4) denotes the number of status bits per block. Cg,Qpa ,
The area of HMAM cache was calculated by assuming
that it is composed of several fully-associative caches, Cg,Qpb , and Cg,Qp are the gate capacitance of the
each of which has its specified size and tags. Table 2
transistor CQ . Finally, C bp and C ba are the effective load
shows Performance / Cost for various cache sizes.
According to the table 2, the HMAM cache (8HMAM, capacitance of each bit line during pre-charging, and
4KB with 8-byte block size) shows about 40% area reading/writing from/to the cell. According to the results
reduction compared to the conventional direct-mapped reported in [20], we have
cache (DC, 8KB with 8-byte block size) while showing C bp = N rows ª¬ 0.5 C drain,Q1 + C bitwire º¼ (8)
almost equal closed performance gains. Moreover, higher
C ba = N rows ª¬ 0.5 C drain,Q1 + C bitwire º¼ + C drain,Q p + C drain,Q pa
performance for HMAM scheme may be achieved by
increasing the size of caches, compared to direct mapping (9)
schemes. where Cdrain,Q1 is the drain capacitance of transistor Q1 ,
and C bitwire is the bit wire capacitance of a single bit cell.
5.2 Power Consumption
Energy dissipation in CMOS integrated circuits can be 5.2.2. Energy dissipated in the word lines. Eword is
mainly caused due to charging and discharging gate energy consumption due to assertion of a particular word-
capacitance. The energy dissipated per transition is given line; once the bit-lines are all precharged, one row is
by [8] selected, performing read/write to the desired data. Eword
E t = 0.5 Ceq V 2 (5) can be calculated as [20]
To obtain the values for the equivalent capacitance, E word = V 2 K > N hit + Nmiss @ >8B+ T+ S@ ª¬2Cgate,Q1 + Cwordwire º¼
Ceq, of components in the memory subsystem, we follow (10)
the model proposed by Wilton and Jouppi [20, 21]. Their where C wordwire is the word-wire capacitance. Thus,
model assumes a 0.8um process and a supply voltage of
3.3 volts. To obtain the number of transitions that occur at C wordwire = N column ª¬ 2 Cgate,Q1 + C wordwire º¼ (11)
each transistor, the model introduced by Kamble and
Ghost [10, 11] is adapted here. According to this model,
the main sources of power are determined to be the 5.2.3. Energy dissipated at the data and address
following four components: Ebits, Eword, Eoutput, and output lines. E output is the energy used to drive external
Einput. These notations denote the energy dissipation for buses; this component includes power consumption for
bit-lines, word-lines, address and data output lines, and both the data sent or returned and the address sent to the
address input lines, respectively. The energy consumption
is then given by lower level memory based on a miss request. E output can
E cache = E bits + E word + E output + E input (6) be calculated as
E output = E addr output + E data output (12)
5.2.1. Energy dissipated in the bit lines. Ebits is the where E addr output and E data output are the energy dissipation
energy consumption of all the bit-lines when SRAMs are
at the address and data lines, and are given by
accessed; it is due to pre-charging lines and reading or
writing data. It is assumed that the tags and data array in E addr output = 0.5 V 2 N addr output Caddr out (13)
the direct-mapped cache can be accessed in parallel. In
order to minimize the power overhead introduced in fully
E data output = 0.5 V 2 N data output Cdata out (14)
associative caches, first a tag look-up is performed and where N addr output and N data output are the total number of
the data array is then accessed only if a hit occurs. In a K-
transitions at the address and data output lines,
way set-associative cache, the Ebits can be calculated as
Ebits =0.5V ª¬NbpCbp+K(N
2
hit +Nmiss)(8B+T+S)(Cg,Qpa +Cg,Qpb +Cg,Qp )+NbwCba +NbrCba º¼
respectively. Caddr out and Cdata out are their corresponding
capacitive loads. The capacitive load for on-chip
(7)
destinations is 0.5pF and for off-chip destinations is
Proceedings of the 12th IEEE International Conference and Workshops on the Engineering of Computer-Based Systems (ECBS’05)
0-7695-2308-0/05 $20.00 © 2005 IEEE
FA 2HMAM 4HMAM 4WSA DC
Table 3. Capacitance values
20pF[14]. 6. Conclusions
5.2.4. Energy dissipated at the address input lines. We have presented the generalized version of the
E input is the energy dissipated at the input gates of the HBAM (Hierarchical Multiple Associative Mapping)
cache. It is similar to set-associative mapping scheme and
row decoder. The energy dissipated at the address called Hierarchical Multiple Associative Mapping
decoders is not considered, since this value turns out to be (HMAM). The simple version of this approach is when
negligible compared to the other components [8]. the k is two as HBAM cache. Results obtained using a
The actual values of different factors of power trace-driven simulator for different scenarios reveal that
dissipation are obtained by using the above-mentioned HMAM can provide significant performance
equations and assumptions. Table 3 shows the various improvements with respect to traditional schemes. The
capacitance values. cost and power consumption of HMAM is less than the
To obtain the actual power consumption, the tag array cost and power of fully-associative scheme. The division
of each section in the proposed cache must be considered parameter k can adjust the power and cost of the HMAM
as a CAM structure. By considering the proposed cache as cache close to the conventional set-associative and direct
a collection of several set-associative caches with mapped caches.
different tag width, these values were obtained. Now the
power consumption of the proposed cache can be
compared with that of the associative cache. Figure 5 7. References
presents the power consumption of the fully-associative,
[1] Agarwal A., Hennessy J., Horowitz M., “Cache
HMAM, 4-way set associative and direct-mapped caches
Performance of Operating Systems and
with the same cache size. As shown before, the fully- Multiprogramming,” ACM Trans. Computer Systems,
associative cache can achieve slightly better performance Vol. 6, No. 4 , 1988, pp 393-431.
gain compared to the HMAM cache, but the aspect of [2] Agarwal A., Pudar S. D., “Column-Associative Caches: a
power consumption can provide a significant effect. Bit- Technique for Reducing the Miss Rate of Direct-Mapped
lines of large block size and a large number of content Caches,” Int’l Symp. on Computer Architecture, 1993,
swaps influence the high power consumption for the pp. 179-190.
fully-associative cache compared to the HMAM cache. [3] ARM Company, “ARM920T Technical Reference
Thus it is shown that power consumption of the 4HMAM Manual.,” http://www.arm.com
cache is reduced by about 5-22% when comparing to that [4] Brigham Young University, “BYU Cache Simulator,”
http://tds.cs.byu.edu
of the fully-associative cache configuration. It should be [5] Calder B., Grunwald D. “Predictive Sequential
noted that power consumption of HMAM cache highly Associative Cache,” Proc. 2nd Int’l Symp. High
depended to its division parameter, k. As this parameter performance Computer Architecture, 1996, pp. 244-253.
grows the power consumption is approaching to the [6] Chen H., Chiang J. “Design of an Adjustable-way Set-
2WSA and DC. Associative Cache,” Proc. Pacific Rim Communications,
Computers and signal Processing, 2001, pp. 315-318.
Proceedings of the 12th IEEE International Conference and Workshops on the Engineering of Computer-Based Systems (ECBS’05)
0-7695-2308-0/05 $20.00 © 2005 IEEE
[7] H. Khalid, “KORA-2 Cache Replacement scheme,” [26] Zhang C., Zhang X., Yan Y., ”Two Fast and High-
Proceedings of the 6th IEEE Electronics, Circuits and Associativity Cache Schemes,” IEEE micro, 1997, pp. 40-
Systems, ICECS’99, vol. 1, pp. 17-21, 1999. 49.
[8] Hennessy J. L., Patterson D. A., “Computer architecture
Quantitative Approach,” 2nd Edition, Morgan-Kaufmann
Publishing Co., 1996.
[9] Hitachi Company: SH4 Embedded Processor.
http://www.hitachi.com
[10] Kamble M. B., Ghose K., “Analytical Energy Dissipation
Models for Low Power Caches,” Proc. of Intl. Symp. on
Low Power Electronics and Design, 1997, pp. 143-148.
[11] Kamble M. B., Ghose K., ”Energy-Efficiency of VLSI
Cache: A Comparative Study,” Proc. IEEE 10th Int’l.
Conf. on VLSI Design, 1997, pp. 261-267.
[12] Kessler R. R., et al., “Inexpensive Implementations of
Associativity,” Proc. Intl. Symp. Computer Architecture,
1989, pp. 131-139.
[13] Kim S., Somani A., “Area Efficient Architectures for
Information Integrity Checking in the Cache Memories,”
Proc. Intl. Symp. Computer Architecture, 1999, pp. 246-
256.
[14] Lee J. H., Lee J. S., Kim S. D., “A New Cache
Architecture based on Temporal and Spatial Locality,”
Journal of Systems Architecture, Vol. 46, 2000, pp. 1452-
1467.
[15] Mulder J. M., Quach N. T., Flynn M. J., ”An Area Model
for On-Chip Memories and its Applications,” IEEE
journal of solid state Circuits, Vol. 26, 1991, pp. 98-106.
[16] Ranganathan P., Adve S., Jouppi N. P. “Reconfigurable
Caches and their Application to Media Processing,” Proc.
Int. Symp. Computer Architecture, 2000, pp. 214-224.
[17] Seznec A., ”A Case for Two-Way Skewed-Associative
Caches,” Proc. Intl. Symp. Computer Architecture, 1993,
pp. 169-178.
[18] Smith A. J. “Cache memories. Computing Survey,” Vol.
14, No. 4, 1982, pp. 473-530.
[19] Standard Performance Evaluation Corporation, SPEC
CPU 2000 benchmarks.
http://www.specbench.org/osg/cpu2000
[20] Wilton S. J. E., Jouppi N. P., ”An Enhanced Access and
Cycle Time Model for On-chip Caches,” Digital WRL
Research Report 93/5, 1994.
[21] Wilton S. J. E., Jouppi N. P., “CACTI: An Enhancement
Cache Access and Cycle Time Model,” IEEE Journal of
Solid-State Circuits, Vol. 31, 1996, pp. 677-688.
[22] Zarandi H., Sarbzai-Azad H., “Hierarchical Binary Set
Partitioning in Cache Memories,” to appear in The Journal
of Supercomputing, Kluwer Academic Publisher, 2004.
[23] Zarandi H., Miremadi S. G., Sarbazi-Azad H., “Fault
Detection Enhancement in Cache Memories Using a High
Performance Placement Algorithm,” IEEE International
On-Line Testing Symposium (IOLTS), 2004, pp. 101-106.
[24] Zarandi H., Miremadi S. G., “A Highly Fault Detectable
Cache Architecture for Dependable Computing,” to
appear in 23rd International Conference on Safety,
Reliability and Security (SAFECOMP), 2004, Germany.
[25] Zhang C., Vahid F., Najjar W., ”A Highly Configurable
Cache Architecture for Embedded Systems,” Int. Symp.
on Computer Architecture, 2003, pp. 136-146.
Proceedings of the 12th IEEE International Conference and Workshops on the Engineering of Computer-Based Systems (ECBS’05)
0-7695-2308-0/05 $20.00 © 2005 IEEE