Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Memory Cache (Finley 2000)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

19. 01. 2023.

07:00 Cache

Cache
Thomas Finley, May 2000

Contents and Introduction


Contents and Introduction
Overview
Direct Mapped Cache
Working in the Fields
Example
Practice Problems
N-Way Set Associative Cache
Sizes of Fields for N-Way Set Associative
2-Way Set Associative Cache
3-Way Set Associative Cache
Set Associative vs. Direct Mapped
Fully Associative Cache
Unifying Theory
Cache Design and Other Details
Line Size
Types of Misses
Writing to Memory
Sub-Blocks
Cache Aware Programming

The purpose of this document is to help people have a more complete understanding of what
memory cache is and how it works. I discuss the implementation and comparative advantages of
direct mapped cache, N-way set associative cache, and fully-associative cache. Also included are
details about cache design. I try to give a complete treatment of the more important aspects of
cache.

This is a supplement for the notes and text, not a replacement. Although this text can be read on its
own without reference to the book or the notes, this is more text heavy, while the notes provide
more valuable visual aides and a far more concise overview than I provide here. Also, there are
many aspects of cache described in the notes that I do not describe in this version of the text, like
sub-blocks.

The reason you should care is because a programmer can design software to take advantage of
cache. If a programmer is aware of cache and its trappings, it is a simple matter to arrange
memory accesses to take advantage of cache and have faster run times. A great proprotion of
computing is accessing memory, and anything that can be done to speed memory accesses can
only be a good thing. Because of this, I consider cache to be one of the more important topics of
computer hardware.

Because cache is such a multifaceted subject, and because learning about cache requires the
introduction of a lot of new ideas, this document is necessarily long. As with all of these notes, I

https://www.cs.cornell.edu/~tomf/notes/cps104/cache.html 1/15
19. 01. 2023. 07:00 Cache

tried to make each section portable, so that if you only need to firm up your understanding on one
subject, you may skip to the relevant section.

Overview
The purpose of cache is to speed up memory accesses by storing recently used chunks of
memory. Now, main memory (RAM) is nice and cheap. Not as cheap as disk memory, but cheap
nevertheless. The disadvantage of this sort of memory is that it is a bit slow. An access from main
memory into a register may take 20-50 clock cycles. However, there are other sorts of memory that
are very fast, that can be accessed within a single clock cycle. This fast memory is what we call
cache.

However, this fast memory is prohibitively expensive, so we can't compose our main memory
entirely of it. However, we can have a small amount, say 8K to 128K on average, in which we store
some of our data. But what data do we choose? The way it happens, we choose data that has
been most recently used. This is a policy that takes advantage of something called spatial and
temporal locality.

Spatial and temporal locality of memory accesses refers to the idea that if you access a certain
chunk of memory, in the near future you're likely to access that memory, or memory near it, again.
For example, you execute consecutive instructions in memory, so the next dozen of so instructions
that you access are probably contained in a given contiguous block of data, so the next instruction
would be accessed close by. Similarly, if you're accessing elements of an array, they will be located
consecutively, so it makes sense to store a block of data that has recently been used. Separate
variables called in a function are usually close by as well. That is what cache does.

When we want to access memory at a certain address, we look in the cache to see if it is there. If it
is there, we get it from cache instead of going all the way to memory; that situation is called a
"cache hit." If the data is not in cache, then we bring the data from memory to cache, which takes
longer since the data must be brought in from the smaller memory; that situation is called a "cache
miss," and that delay that we endure because that memory must be brought from main memory is
called a "miss penalty."

Direct Mapped Cache


Direct mapped cache works like this. Picture cache as an array with elements. These elements are
called "cache blocks." Each cache block holds a "valid bit" that tells us if anything is contained in
the line of this cache block or if the cache block has not yet had any memory put into it. There is
also a block of data called a "line" which stores a recently used block of data. Perhaps most
importantly, there is a number called a "tag" composed of more significant bits of the memory
address from which the line's data came from.

more cache blocks up here... You can think about the direct mapped cache
this way. Each row in the table to the left
V T L N represents a cache block. We have our valid
bit which tells us if this cache block currently
      N+1 holds data. Next is our tag, a number that tell
us where in memory this bit is from. After that,
      N+2
we have our line, which is the data that we
      N+3 have stored in cache.
      N+4 The number to the right is just the cache index.
Cache blocks are indexed like elements in an
more cache blocks down here... array, and there is a cache index associated
with each cache block. As you see, I show
blocks N through N+4, where N is some integer

https://www.cs.cornell.edu/~tomf/notes/cps104/cache.html 2/15
19. 01. 2023. 07:00 Cache

such that these are valid cache blocks for this


particular cache.

As a nota bene, you must know that for dirct mapped cache, cache sizes and line sizes (and thus
the number of cache blocks) are made in powers of two. It works out better for the hardware
design if this is the case. Therefore, we're not going to see direct mapped cache's with 15
kilobytes, or 700 bytes, or what have you. We're working only in powers of two for direct mapped
cache.

For a direct mapped cache, suppose we have a cache size of 2M bytes with a line size of 2L bytes.
That means that there are 2M-L cache blocks. Now, we use the address of the data we're trying to
access to determine which of these cache blocks we should look at to determine if the data we
want is already in memory.

Working in the Fields

32-M bits M-L bits L bits


Tag Index Byte Select

We use the address in this way. Certain bits in our 32 bit memory address have special
significance. We group the bits into three distinct groups we call fields. In this discussion, M and L
are defined just as they were a paragraph ago.

The rightmost L bits of the address is the byte select field. If data is found in a cache block, since
addresses are byte addressed, we use this field to determine which of the bytes in a line we're
trying to access.

The M-L bits just to the left of the byte select field is called the index field. As we said before, we
have 2M-L cache blocks in our cache. These cache blocks are ordered, just as in an array. There is
a first one (with index 0) and a last one (with index M-L-1). If we're given an address, the value
represented by the bits of the index field tell us which of these cache blocks we should examine.

The leftmost 32-M bits of the address, just to the left of the index field, is called the tag field. This
field is put into the tag part of the cache block, and that uniquely identifies from where in main
memory the data in the line of this cache block came from. If we're looking at a cache block, we
presumably know what index of this cache is, and therefore in conjunction with the cache tag we
can compose the addresses from which the data in the line came from. When we store a line of
data into a cache block, we store the value represented by the tag field into the tag part of the
cache block.

As a side note, it is not written in stone that the index field must be composed of the bits
immediately to the left of the byte select field. However, the way that memory accesses tend to
work, it happens that if the index field is in the bits immediately to the left of the byte select field,
when we look for data, it is in cache rather than in main memory a larger proportion of the time.
While there might be some special cases in which having the index field elsewhere would yield a
higher hit ratio, in general it is best to have the index field immediately to the left of the byte select
field.

When we're checking cache to determine if the memory at the address we're looking for is actually
in cache, we take the address, and extract the index field. We find the corresponding cache block.
We then check to see if this cache block is valid. If it's not valid, then nothing is stored in this cache
block, so we obviously must reference main memory. However, if the valid bit is set, then we
compare the tag in the cache block to the tag field of our address. If the tags are the same, then
we know that data we're looking for and the data in the cache block are the same. Therefore, we
may access this data from cache. At this point we take the value in the byte select field of the

https://www.cs.cornell.edu/~tomf/notes/cps104/cache.html 3/15
19. 01. 2023. 07:00 Cache

address. Suppose this value is N. We are then looking for byte N in this line (hence the name byte
select).

Example
However, these things are always more clear when we talk about a specific example. Let's talk
about my first computer, a Macintosh IIcx. Suppose we have a direct mapped cache whose cache
lines are 32 bytes (25 bytes) in length, and suppose that the cache size is 512 bytes (29 bytes).
Since there are 512 bytes in the cache, and 32 bytes per cache block, that means that there are
512/32 = 16 cache blocks.

In reality, a 512 byte cache is a very small cache; you'd never find a cache that small in today's
computers. However, it makes the example that follows managable.

Now, suppose we have addresses that are thirty-two bits in length. In this example, taking M and L
to have the same definitions as they did at the beginning of our discussion on direct mapped
cache, M = 9 and L = 5. So, our byte select field is the rightmost 5 bits. Our index field is the next
M-L = 4 bits. Our tag field is the leftmost 32-M = 23 bits.

This actually makes sense. Since we have 16 cache blocks, we can uniquely represent the 16
possibilities of cache blocks we may address by 4 bits, which is the size of our index field.
Similarly, we have 32 bytes that we can access in the data potion of our cache block since line size
is 32 bytes, and we may uniquely address these 32 bytes with 5 bits, and indeed that is the size of
our byte select field.

To illustrate the working of this, I'll go step by step through a series of memory accesses. In my
table represenation of cache I omitted the cache line to save space, because it really doesn't
matter what the line holds. The data in cache is the data in cache, and that's all there is to it. We're
interested in what memory is accessed where, because that is what affects the workings of the
cache. That changes the state of the valid bits and the tag numbers of our cache blocks.

For this example, I provide a series of tables, which represents the cache blocks, and their
corresponding valid bits and tags. The first number in a row in the table is the index of this
particular cache block. The next number is the valid bit. The rightmost number is the tag, which
tells us where the data in the line (which I do not show) is from.

Initially

The table for the initial conditions shows that the valid bits are all set to zero, because nothing has
yet been stored in cache.

Step 1

For step one, suppose we access the memory at address 0x0023AF7C. The looking at that in
binary, that is 000000000010001110101111011111002. If we separate these bits into the lengths of
the fields we have determined with dashes between each field, that is more easily read as
00000000001000111010111-1011-11100. So, our index is 10112=1110. We look at index 11. Is
there anything in there yet? We see there is not. Therefore, we load the data contained at memory
addresses 0x0023AF60 through 0x0023AF7F into the 32 byte line of the cache block with index
11.

Note that the memory addresses between those two addresses are the memory addresses that
would have the same tag and index fields if we broke the addresses into their respective fields.

Now that the data is loaded into the line, we set the tag for this cache block to be the same as the
tag of our address, since this cache block will now hold a line of data from the addresses with the
same index and tag. This tag is 10001110101112=456710. We also must set our valid bit to true
https://www.cs.cornell.edu/~tomf/notes/cps104/cache.html 4/15
19. 01. 2023. 07:00 Cache

since this cache block now holds a block of data. These changes are reflected in the table for step
1.

After these changes and the memory loads, we know that the data we want is in cache block 11.
Our byte select field is 111002 = 2810, so we get the 29th byte of our line.

Since 0 would address the first byte, 28 addresses the 29th byte. It's exactly as though the cache
line is an array of chars (a byte length quantity), and the byte select is the index of the particular
element of that array that we're looking at right now.

Now, this is a cache miss, but there are different sorts of cache misses. Notice that no data has
been loaded into cache yet. That is because no memory accesses have taken place. Because
we're just starting out, this is called a "compulsory miss." We cannot help but not have data in this
cache block yet, because we just started.

Step 2

After step 1, suppose we have another memory access, this time at memory address 0x0023AE4F.
Ignoring the intervening binary operations, this works out to have a byte select field of 15, an index
field of 2, and a tag field of 4567.

If we look at cache block 2, we see that it is not valid. Similar to last time, we load the 32 bytes of
memory from addresses 0x0023AE40 to 0x0023AE5F into the line of this cache block, change the
tag to 4567, and set the valid bit to true.

The tag field is the same as in the previous operation; this was intentional on my part. As we see,
the fact that the tag is the same here as it was last time is of no significance, since the two are part
of completely different indexed cache blocks; they are unrelated insofar as cache is concerned.

Similar to last time, the data we want is in the cache line. Our byte select field is 15, so we access
the 16th byte (since 0 would address the first byte) of the line in cache block 2.

Even though we've already had a memory access in step two, this miss is also called a compulsory
miss because the cache block that we're loading into hasn't yet had any data loaded into it.

Step 3

Suppose now we have a memory access at address 0x148FE054 (in binary,


00010100100011111110000 0010 10100). This works out to have a byte select field of
101002=2010, index field of 00102=210, and a tag field of 000101001000111111100002=67377610.

We look at cache block 2. We see that it is valid. We then compare the tags. 673776 is not equal to
4567, which is the tag of cache block 2. This means that, even though there is data in this cache
block, the data we want is not in here. We must load the data, and we do so just as if this block
had not been valid.

We load the data at addresses 0x147FE040 to 0x147FE05F into the line at cache block 2. We set
the tag to 673776. We don't need to set the valid bit to one since it already is one. Now this cache
block holds a new portion of data.

As you see, we had two pieces of data that wanted to be in the same cache block: there was the
data we loaded in the previous step, and the data we just loaded in this step. Now, this is not a
conflict miss. What would be a conflict miss is if we tried to load the data that used to be in the
cache (the data that we just replaced) back into cache. When this sort of miss occurs, that is called
a "conflict miss," or rather a "collision."

https://www.cs.cornell.edu/~tomf/notes/cps104/cache.html 5/15
19. 01. 2023. 07:00 Cache

We also see one of the primary weaknesses of the direct mapped cache. At the end of step two we
had two blocks used up, which means that we're using 64 bytes of our 512 byte cache. However,
in this step, step three, because of a conflict miss we overwrote one of those blocks even though
we had 448 bytes unused in our cache. Though this example is completely contrived, this sort of
cache inefficiency is a common problem with direct mapped cache.

Once we load the data into this cache and set the tag field appropriately, we can find the byte that
we want using the byte select field, at index 2 (the 3rd byte).

Step 4

Now suppose we try to access memory address 0x0023AF64. This works out to have byte select
field of 001002=410, index field of 10112=1110, and tag field of 10001110101112=456710.

We look at cache block 11, and it is valid, so we compare the tags. The tags are the same, it turns
out, so we finally have a cache hit! The data that we want is in this cache block. Since our byte
select field is 4, we access the fifth byte of this block's cache line. We don't modify cache at all;
tags and valid bits remain the same, and we needn't pull any data from main main memory since
the memory we want has already been loaded into cache.

This block was loaded with data in step one. The memory access for step one was 0x0023AF7C.
Now, this step's memory access, 0x0023AF64, is obviously a different address than that, but
0x0023AF64 and 0x0023AF7C are pretty close. Since in the memory access for 0x0023AF7C we
loaded that cache block's line with memory from 0x0023AF60 to 0x0023AF7F (32 bytes in all), the
data for 0x0023AF64 could be found there. So, two memory accesses may access the same data
in cache even though they are not exactly the same.

This set of tables trace the 16 cache blocks of the cache example through each
of the memory accesses described above. Each "step" table reflects the state
of cache at the end of each step.
Initially Step 1 Step 2 Step 3 Step 4
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 1 0 0 1 0 0 1 0 0
2 0 0 2 0 0 2 1 4567 2 1 673776 2 1 673776
3 0 0 3 0 0 3 0 0 3 0 0 3 0 0
4 0 0 4 0 0 4 0 0 4 0 0 4 0 0
5 0 0 5 0 0 5 0 0 5 0 0 5 0 0
6 0 0 6 0 0 6 0 0 6 0 0 6 0 0
7 0 0 7 0 0 7 0 0 7 0 0 7 0 0
8 0 0 8 0 0 8 0 0 8 0 0 8 0 0
9 0 0 9 0 0 9 0 0 9 0 0 9 0 0
10 0 0 10 0 0 10 0 0 10 0 0 10 0 0
11 0 0 11 1 4567 11 1 4567 11 1 4567 11 1 4567
12 0 0 12 0 0 12 0 0 12 0 0 12 0 0
13 0 0 13 0 0 13 0 0 13 0 0 13 0 0
14 0 0 14 0 0 14 0 0 14 0 0 14 0 0
15 0 0 15 0 0 15 0 0 15 0 0 15 0 0

https://www.cs.cornell.edu/~tomf/notes/cps104/cache.html 6/15
19. 01. 2023. 07:00 Cache

Wow. That just kept going, didn't it?

Practice Problems
These are just a couple of questions that demonstrate the most basic concepts of direct mapped
cache. The answers are provided after all the questions. These aren't terribly difficult, but they
should get you thinking about cache. All of these questions pertain to direct mapped cache, in a
computer with 32-bit addressing.

Questions

1. If we have a cache size of 4KB with a 16 byte line size, what will be the sizes of the three
fields (tag, line, and byte select) which we divide our address into?
2. Suppose our fields are arranged as described above (first tag field, then index field, then byte
select field) in a cache of 16KB with 32 byte line sizes. What would be the values of each of
the three fields for the following addresses?
a. 0x00B248AC
b. 0x5002AEF3
c. 0x10203000
d. 0x0023AF7C

Answers

1. Our cache size is 4KB = 212 bytes, with a line size of 16 bytes = 24 bytes. Therefore, our byte
select field is 4 bits, our index field is 12-4 = 8 bits, which leaves 20 bits for our tag field.
2. This is actually two problems in one. We must first determine the sizes of each of the fields.
Our cache size is 16KB = 214 bytes, with a line size of 32 bytes = 25 bytes. So, our byte
select field is 5 bits, our index field is 14-5 = 9 bits, which leaves 18 bits for our tag field.
a. For 0x00B248AC, tag is 0x2C9, index is 0x45, and byte select is 0xC.
b. For 0x5002AEF3, tag is 0x1400A, index is 0x177, and byte select is 0x13.
c. For 0x10203000, tag is 0x4080, index is 0x180, and byte select is 0x0.
d. For 0x0023AF7C, tag is 0x8E, index is 0x17B, and byte select is 0x1C.

N-Way Set Associative Cache


Another sort of cache is the N-way set associative cache. This sort of cache is similar to direct
mapped cache, in that we use the address to map the address to a certain cache block. The
important difference is that instead of mapping to a single cache block, an address will map to
several cache blocks. For example, in a 2-way set associative cache, it will map to two cache
blocks. In a 5-way set associative cache, it will map to five cache blocks.

In this cache there may be several cache blocks per index. This group of cache blocks is referred
to collectively as an "index set." In our direct mapped cache, there was one cache block per index
set. In an N-way set associative cache, there are N cache blocks per index set.

So, we have an address, and like in direct mapped cache each address maps to a certain index
set. Once the index set is found, the tags of all the cache blocks in this index set are checked in
parallel. If one of those cache blocks holds the data that we're looking for, then the data for that
one block is extracted.

https://www.cs.cornell.edu/~tomf/notes/cps104/cache.html 7/15
19. 01. 2023. 07:00 Cache

If one of those blocks does not hold the data we want, then like before we have a cache miss, and
the data must be loaded. If one of the cache blocks in this index sets is free, then we can simply
load it in there. If, however, all of the blocks are already taken, then one of those blocks must be
overridden. We choose to free up the one that has been least recently used, and write the new
data in that block. This is a policy called Least Recently Used (LRU) replacement.

There are any number of ways (some better than others) in implementing a system that traces the
order in which cache blocks of an index set have been used, so that we may appropriately choose
which cache block to replace in if we must replace a block. It is not very important to know exactly
how this is done, but you should think about ways of implementating this feature.

Sizes of Fields for N-Way Set Associative


Like direct mapped cache, when memory accesses occur the address is broken into three fields:
tag, index, and byte select. The sizes of fields for N-Way Set Associative are very similar to those
of direct mapped cache.

For an N-way set associative cache, suppose we have a cache size of N·2M bytes with a line size
of 2L bytes. That means that there are N·2M-L cache blocks, and at N cache blocks per index set,
there are 2M-L index sets.

For example, suppose we have a three-way set associative cache of size 12KB, with line size of 16
bytes. Our line size may be expressed as 24 bytes, so L=4. Since N=3, and 12KB = 3·4KB = 3·212
bytes, M=12. So, in this example, our byte select field would be 4 bits long, our index field would
be 12-4=8 bits long, and our tag field would be 32-12=20 bits long.

Example : 2-Way Set Associative Cache

Suppose we have a series of memory accesses. For the sake of the example, suppose that the
memory accesses all map to the same index set, but that the addresses' tags differ. Which index is
unimportant, since they're all to the same index set. What is important is the tag of each memory
access.

For each step I provide a table that shows the tags of the single index set that we're considering.
The cell on the right is that structure which tells us the order in which things were last accessed. I
have it initially set so that the system will think that the first block was the least recently used block,
so the first block (leftmost) will be the one first used. The table is shown at the end of a step to
show what the index set looks like at that point.

Block 1 Block 2
Second, then first.
No tag No tag

Now, suppose we have a series of these memory accesses to the same index set as described,
with the following tags for each memory access: 0x00010, 0x8A02F, 0x00010, 0xF1EEE.

For the first access (tag 0x00010), we're assuming that this index set is empty. So, we load the
data with tag 0x00010 into the first of our two cache blocks.

Block 1 Block 2
First, then second.
0x00010 No tag

It is important to remember that in N-way set associative caches there is a structure for each index
set that tells us the order in which the index set's cache blocks were last accessed. For our two-
way cache, the possible states for the order of access are "first, then second" or "second, then
first." For a three-way cache, the possible states are "first, then second, then third," "first, then
https://www.cs.cornell.edu/~tomf/notes/cps104/cache.html 8/15
19. 01. 2023. 07:00 Cache

third, then second," etc. (There are six states in all for a three-way cache.) This tells our hardware
just which block to replace if we must replace a block, so that our more recently used blocks are
not disposed of before their time.

For the second access (tag 0x8A02F), we compare the tags. 0x00010 does not equal 0x8A02F,
but we have the second cache block unused (that is, invalid), so we stick the data with tag
0x8A02F into the second of the two cache blocks. At this point we have another structure in this
index set that lets the hardware know that the second cache block has been used most recently.

Block 1 Block 2
Second, then first.
0x00010 0x8A02F

For the third access (tag 0x00010), we compare the tags. The tags for the first of our two cache
blocks match. Our tags don't change and we don't load any more data, but we change our
structure that describes the order in which the blocks were last accessed to reflect that the first
block was more recently accessed.
Block 1 Block 2
First, then second.
0x00010 0x8A02F

For the fourth and last access (tag 0xF1EEE), we compare the tags. There is no match, and there
are no blocks free for this index set. So, what happens? We replace the block that has been least
recently used. We have stored the information that, for this index set, the first cache block was
more recently used. Therefore, we replace the second block. We keep the data with tag 0x00010,
and discard the data with tag 0x8A02F and load the data with tag 0xF1EEE in its place into the
second block. Now we change the order of access to reflect that the second block was more
recently accessed than the first block.

Block 1 Block 2
Second, then first.
0x00010 0xF1EEE

Example : 3-Way Set Associative Cache


I'm not going to go tremendously in depth here. I'm just going to show a 3-way cache, with each
block's tag shown (or the text "no tag" if the block is invalid). After each index set, there will be a
cell that tells us the order of "last accessed" for this index set.

Notice that we have this "order of access" field for each index set. Also notice that it doesn't suffice
to say "the third was the block least recently used," because if we were to access that third block,
then we have no way of telling which block was the one least recently used.

While we can have up to three memory locations that map to the same index in cache at the same
time, we can't have four. We therefore run into problems similar to those discussed in direct
mapped cache (conflict cache misses), even though we can store more cache blocks per index
set. The problem of conflict misses has been alleviated, but not solved.

A simplified view of a 3-way set associative cache with 16


index sets.
Index Block 1 Block 2 Block 3
Third, then first,
0 0x49206 0xC6F76 0x65204
then second.
First, then
1 0x36172 0x6F6C2 No tag second, then
third.
https://www.cs.cornell.edu/~tomf/notes/cps104/cache.html 9/15
19. 01. 2023. 07:00 Cache

Second, then
2 0x04272 0x61756 0xE2C20
first, then third.
Third, then
3 No tag No tag No tag
second, then first.
First, then
4 0x62757 0x42073 0x68652 second, then
third.
Second, then
5 0x0646F 0x65732 0x06E6F
third, then first.
Second, then
6 0x74206 0xC6F76 No tag
first, then third.
First, then
7 0x65206 No tag No tag second, then
third.
Third, then first,
8 0xD652E 0x0B3F9 0x636FC
then second.
Second, then
9 0x4D67A 0x7AB7E No tag
first, then third.
First, then
10 0x78A7F No tag No tag second, then
third.
First, then third,
11 0x22448 0x37CAD 0x3DA1E
then second.
Second, then
12 0x33C1A 0x14E66 0x5D56C
third, then first.
First, then third,
13 0x3414D 0x2585F 0x202E1
then second.
Second, then
14 0x295D0 0x0BC9F No tag
first, then third.
Third, then
15 0x632E0 0x255C9 0x82823
second, then first.

Set Associative vs. Direct Mapped


When we're dealing with an N-way set associative cache, when we're looking at a particular index
set we have the added burden of comapring multiple tags to the tag of our address to see if there
is a match. This is done in parallel, so we must have as many comparators as there are blocks in
an index set to check the tags of the blocks to the address tag. We also have to deal with the
access histories of each index set, another thing we didn't have to do with direct mapped cache.
So, N-way set associative cache is considerably more difficult to design and to produce, and is
therefore more expensive. For the same money, an N-way set associative cache would be smaller
than a direct mapped cache, though the advantages of set associative with regard to conflict
misses and hit ratios often justifies the trade.

Fully Associative Cache


Fully associative cache is perhaps the easiest cache to understand and explain because it acts
intuitively as one would expect a cache to work. We load cache blocks similar to before, and when
we fill up the cache and have to replace a cache block, we dispose of the block that has been least
https://www.cs.cornell.edu/~tomf/notes/cps104/cache.html 10/15
19. 01. 2023. 07:00 Cache

recently accessed. It's exactly as though we take the N-way set associative cache, and set N such
that there was only one index set. So, we're able to ignore all the garbage about the index field. We
only have a tag field, and a byte select field.

This sounds good as an idea, since fully associative cache acts exactly as a cache "should"; we
don't run into the problem where a more recently used block is discarded and a less recently used
block is kept simply because the index set where that recently used block happened to be stored
was more busy. Clearly fully associative cache has its attractive aspect.

However, from a hardware designer's point of view fully associative cache is extraordinarily
complex. First of all, when you are checking the cache to see if a tag is there, you must check all
the tags of the cache in parallel. That means that there must be as many comparators as there are
cache blocks in the entire cache. In direct mapped cache we only needed one comparator, and in
N-way set associative cache we needed N comparators to check the tags of cache blocks in
parallel.

Also, we must deal with the fact that we must trace the order in which blocks were accessed.
When we were dealing with two or three blocks per index set, this wasn't a huge problem.
However, as an example, if we have to trace the access history all the blocks of the entire cache,
supposing we had C cache blocks to keep track of, then we have C! possible states for access
histories. As a concrete example, if we have something as small as a 1K cache with 32 byte line
sizes, that's (1024/32)! = 32! = 2.6313·1036 combinations for access histories. For our CPS120
graduates, those states may be represented in 118 bits (possibly less after optimizations, but don't
expect me to figure that out). The point is, no matter how we implement this cache, the circuitry for
handling these combinations of access histories (especially considering that the circuitry would
have to do its job in at most one or two clock cycles) would be extraordinary for larger sized
caches.

Because there is so much hardware involved, fully associative caches are necessarily very
expensive. When fully associative caches are even used its cache size is necessarily tiny
compared to the direct mapped and N-way set associative caches. However, for very small caches
with very few blocks fully associative is often a logical choice.

Unifying Theory
If this all seems too complex, the important thing to realize about all these cache types is that they
are all really the same cache type: N-way set associative.

The direct mapped cache is actually the special case of one-way set associative. For each "index
set" there is only one cache block. You may notice that there is no structure in direct mapped
cache that keeps track of the access history of the blocks of each index set, but that is because
there is only one possibility; that the one block we have was accessed least recently (or most
recently depending on how you like to look at it).

Fully associative cache is equivalent to the special case of N-way set associative cache where N is
chosen such that N equals the total number of blocks in the cache, i.e., such that there is only one
index set. It is true that N-way set associative cache has an index field and fully associative cache
does not, but in this case the index field works out to have a length of 0 bits, i.e. there is no index
field. This is consistent with our discussion of fully associative cache.

Cache Design and Other Details


Line Size

In our examples, we often discuss line sizes (alternatively called block sizes) of 32 bytes, or 16
bytes, or what have you. But there is no real reason why you can't have line sizes of 8 bytes, or
https://www.cs.cornell.edu/~tomf/notes/cps104/cache.html 11/15
19. 01. 2023. 07:00 Cache

maybe 128 bytes, or 512 bytes.

There is some appeal to having a larger line size, especially when dealing with many memory
accesses with sequential addresses. Suppose, for example, you're printing a string of many
characters, or accessing elements of a large array of integers for computation, or executing a large
number of sequential instructions.

If we're accessing integers in an array sequentially for computation, what we're really doing is
accessing four byte quantities at a time. If our line size is 32 bytes, when we first load an integer
we have a cache miss, and that line is loaded into memory. The next seven accesses to the array
will access that line as the integers are read sequentially, so those will be cache hits. Then an
integer is read that's just beyond the line we just loaded, so we must load the next line. This is a
cache miss, but again the next seven will be hits since the integers will be in that line we just
loaded. This trend continues. One miss, seven hits. One miss, seven hits. Et cetera, et cetera. All
in all, we have a miss rate of one eighth. However, if our line size is 64 bytes, we'll have one miss,
fifteen hits, one miss, fifteen hits, a miss rate of one sixteenth. Larger line sizes takes advantage of
spatial locality.

Larger line sizes aren't all good, however. The miss penalty, or the time we are delayed when we
have a cache miss, is increased since when we bring the data into the cache we must copy more
data, which takes more time.

The other thing is that as we increase our line size, the number of our blocks will decrease. While
this needn't be extremely bad, examine the extreme case where line size equals the cache size, so
there is only one cache block in our entire cache. Suppose that our cache size is 8K, so our block
size is 8K. So long as we access memory in the same contiguous 8K of space, we'll be great, but if
the program ever strays from that memory space we'll necessarily have a miss. If we have a
program that jumps around between two distinct memory locations, accessing one area, then
accessing the other, back and forth, we'll necessarily have a miss every time, whereas if we had
two cache blocks, after the data is initially loaded we'll have a hit every time.

The point is that line sizes are best taken in moderation.

Types of Misses
A cache miss occurs when the data we're looking for is not in cache, and so we must load it from
main memory. In the example for direct mapped cache, we alluded to the fact that there are
different types of cache misses. We mentioned two, but there are four, and how a system designer
goes about minimizing misses depends greatly on the type of misses involved.

In this explanation I provide an analogous situation. Suppose we're writing a paper in Perkins. We
can take books (data) from stacks (main memory) and put them on our desk for easy reference
(cache).

Compulsory Miss

A compulsory miss occurs at the first access of that memory location, so memory location has not
yet been loaded into cache. These are also called cold start misses, or first reference misses.
There isn't much that can be done about this problem. However, since a compulsory miss only
occurs when data is first loaded, in the long run with a program with many memory accesses,
compulsory misses are not a problem.

This situation is analogous to when you're just starting your research in Perkins. You desk is empty,
so you must first get some of the books that you initially want. You can't help this, since the books
won't be at the desk waiting for you when you arrive.

Conflict Miss (Collision)

https://www.cs.cornell.edu/~tomf/notes/cps104/cache.html 12/15
19. 01. 2023. 07:00 Cache

A conflict miss occurs when there are many memory accesses mapped to the same index set in a
cache, so data in a block of in this index set may be discarded even though another block in
another relatively unused index set might be even older. If this said block is discarded and later
retrieved, that is called a conflict miss.

This doesn't have a clean analogy to my Perkin's library situation, but suppose that there is a policy
in the library that you may only have a certain number of books of the same color at any one time.
For example, you may have a maximum of three blue books, and three red books, and three
yellow books, and three fuchsia books. You may have space on the desk for more, but since you
have three fuchsia books, you'll have to put a book back into stacks before you get another.

In this analogy, the index sets are like the different colors, and the amount of books is like the set
associativity. In this example we had a maximum of three books of a certain color at one time, so
this was like a 3-way set associative cache.

We can alleviate this problem if we increase the number of books of a certain color we can have at
once. That is, if we increase the set associativity.

We can also increase the size of the cache. All other things being equal, this has the effect of
increasing the amount of index sets in the cache. In our analogy that would be the same as not
only enlarging the des, but also refining the spectrum of colors. For example, at first we may only
consider three colors: red, yellow, and blue. However, we may include other colors, and instead our
spectrum may be red, orange, brown, yellow, aquamarine, and blue. Some colors that were
considered red may now be considered orange, some colors that used to be considered yellow we
may now call brown, et cetera. We increase the number of books we may have on our desk in this
sneaky way. Looking outside of the analogy for a second, if we double the cache size, our index
field grows by one bit. Instead of an index of 10010, for example, we could have indexes of 010010
OR 110010; our "spectrum" of indexes has increased.

As you may see, a fully associative cache would not have this problem because we ignore the
nonsense of index sets entirely. We only have one "index set" (if you can call it that) and you can
store as many cache blocks in that index set as there is space in the cache.

Capacity Miss

A capacity miss occurs when a memory location is accessed once, but later because the cache fills
up, that data is discarded, and then when we get a miss when accessing that memory location
again because that data is no longer in memory.

The analogy for this is simple. Your desk fills up, so you must send back a book to make room for
new books on the desk, even though you may need the book you're sending back later.

The solution to this is to increase the cache size; that is, get a bigger desk.

Invalidation Miss

What happens if main memory whose data is also loaded in cache is modified? In that case the
cache data has been invalidated, and if the computer looks for it again it will have to be reloaded
from main memory to get the up-to-date information. You might say, "that could never happen; all
memory accesses are piped through the memory system, so cache would catch it." That's not
always the case. I/O devices often write directly to memory (especially in older machines), so it is
possible that we may have to reload data from main memory even though it may be in cache. This
is called an invalidation miss.

The analogy for this is that a new edition of a book you have on your desk comes out, and it has
new or corrected information. Regardless of how recently you brought this now out-of-date book to
your desk, you must get the new book rather than refer to dated material.

https://www.cs.cornell.edu/~tomf/notes/cps104/cache.html 13/15
19. 01. 2023. 07:00 Cache

Writing to Memory

I alluded to the potential problems of memory writing in my brief discussion of the invalidation miss,
but for the most part throughout this text I have begged the issue of what happens to cache in the
event of a memory write.

Memory writes are difficult to implement because something in memory changes when you do a
memory write, and it is difficult to keep cache and main memory consistent with whatever change
was made by the program. Memory reads, on the other hand, are no problem. If all a program
does is read memory, memory cannot change, so there is no need to make sure that cache and
memory are consistent with each other.

Before we go about making cache and memory consistent, however, the primary question is where
our data will go; will writes first go in cache, in which case we'd have to bring main memory up to
date later (called a "write back" policy), or write directly to main memory (called a "write through"
policy)? The answer is that we can choose from either of these policies.

Write Back

Write back is the policy of writing changes to cache. This is simple enough; our cache blocks stay
valid since they have the latest data. The problem with that is that if modified blocks ever were
overwritten in the cache, we'd have to write the now discarded data back to main memory or
changes will be lost.

One inefficient solution would be to simply write all discarded data back to main memory, but this is
undesirable; if every cache miss involved not only a read from main memory but in addition a write
to main memory, there would be a dramatic increase in the miss penalty. A more efficient solution is
to assign a "dirty bit" to each cache block. In the course of a program's execution if a block of data
is modified we set the dirty bit, and write the data back to main memory if the dirty bit for the block
we're discarding has been set.

The advantage of a write back policy is that instead of writing to slower memory, we're writing to
cache. Just as we're more likely to read from the same location in memory in a given chunk of
time, the same can be said for memory writes. Instead of writing to memory every time we have a
memory write, we only write back once we're done with that block of data.

One disadvantage of this is that its control is somewhat complicated compared to write through.

Write Through

Write through is the policy of writing directly to cache, but also writing to memory. The problem is
that writing data back to RAM slows things down. The solution is to have something called a write
buffer. We stick our changed data in the write buffer, and whenever we have a DRAM write cycle
the write buffer is emptied and the data is changed.

A problem arises here. A write buffer exists so a computer can continue to execute instructions
faster than the time it takes to write new data through to main memory. It is therefore conceivable
that a whole bunch of write instructions could fill the buffer faster than the buffer's capacity to get
rid of data already in the buffer.

This is not an insurmountable problem. We can install a level two cache to hold writes. This also
has the effect of speeding up the process of writing back to memory, since this L2 cache will be
faster than DRAM. By appropriately regulating the flow of data between DRAM and the L2 cache,
write overflow may be overcome.

Sub-blocks

https://www.cs.cornell.edu/~tomf/notes/cps104/cache.html 14/15
19. 01. 2023. 07:00 Cache

This revision does not cover sub-blocks. Sorry. If it's any comfort, I don't even remember them
being mentioned in lecture, and we never did anything with them. I'm writing this on vacation, so
I'm working from memory; I can't remember much about them aside from the fact that sub-blocks
reduce the miss penalty.

Cache Aware Programming


In the previous section we talked about ways of reducing misses by changing the cache size, set
associativity, etc. The problem is that we as programmers don't deal with the machine at that level.
We already have a machine, and the cache of that machine is the cache of that machine, and we
must deal with that. The only way we are to minimize these misses would be to modify our
program to take advantage of cache. If we have two different programs that do the same thing, it is
likely that one program may use cache more efficiently than the other. There are plenty of
examples of equivalent cache efficient and cache inefficient programs in the notes. Depending on
the sort of program you're working on, it may also be helpful to understand the basics of the
memory system of the particular computer your program will run on.

Another source of inefficiency not mentioned (at least in my version of the notes) arises from one
of the few weaknesses of object oriented programming. When you're working with objects, you call
a lot of class object functions. The result is that we're accessing instructions (stored in memory) all
over the place. This is cache inefficient, so if you have a C program and a C++ program that do the
same thing, the C program will tend to run a bit faster, especially if the C++ program takes
advantage of its object oriented nature.

Thomas Finley 2000

https://www.cs.cornell.edu/~tomf/notes/cps104/cache.html 15/15

You might also like