Lec2 PDF

Lecture 2: Memory Systems
 Basic components
 Memory hierarchy
 Cache memory
 Virtual Memory
Internal and External Memories
CPU D t transfer
Date t f Main
Memory
Control
Data transfer
Control
Secondary Memory
1
Main Memory Model
A word (8, 16, 32, or 64 bits)
Memory
Control
Unit
3
address 21
0
a bit Read/write control Address selection
MBR (in CPU) MAR (in CPU)
Memory Characteristics
The most important characteristics of a memory:
 speed — as fast as possible;
 size — as large as possible;
 cost — reasonable price.
They are determined by the technology used for implementation.
Your personal library
2
Memory Access Bottleneck
CPU
Quantitative measurement of the capacity

of the bottleneck is the Memory Bandwidth
Memory
Memory Bandwidth
 Memory bandwidth denotes the amount of data that can
be accessed from a memory per second:
1
M-Bandwidth
M Bandwidth = ∙ amount of data p
per access
l ti
memory cycle time
Ex. MCT = 100 nano second and 4 bytes (a word) per access:
M-Bandwidth = 40 mega bytes per second.
 There are two basic techniques to increase the bandwidth

of a given memory:
 Reduce the memory cycle time
• Expensive
• Memory size limitation
 Divide the memory into several banks, each of which has its own
control unit.
3
Memory Banks
Interleaving
g placement
p of program
p g and data
12 13 14 15
8 9 10 11
4 5 6 7
0 1 2 3
Control Control Control Control

Unit Unit Unit Unit
CPU
 Cache memory
 Virtual Memory
4
Motivation
 What do we need?
 A memory to store very large programs and to work at a speed
comparable to that of the CPU.
 The reality is:

 The larger a memory, the slower it will be;
 The faster the memory, the greater the cost/bit.
 A solution:
 T
To build
b ild a compositeit memory systemt which
hi h combines
bi a smallll
and fast memory with a large and slow memory, and behaves,
most of the time, like a large and fast memory.
 The two level principle above can be extended into a hierarchy
of many levels.
Memory Hierarchy
CPU
Registers
Cache
Main Memory
Secondary Memory
of direct access type
Secondary Memory
of archive type
5
Access time
Memory Hierarchy
example
CPU Capacity example
1-10 ns Registers 16-256
As one goes down the
10 0 ns
10-50 Cache 4 512K
4-512K hierarchy, the following
occur:
40-500 ns Main Memory 4-256M
 Decreasing cost/bit.
 Increasing capacity.
5-100 ms Secondary Memory  Increasing access
40G/unit
(for 4KB) of direct access type time.
 Decreasing frequency
of access by the CPU.
0.5-5 s Secondary Memory 50M/tape
(for 8KB) of archive type
 Cache memory
 Virtual Memory
6
Mismatch of CPU and MM Speeds
4
10
Speed Gap
(ca. one order
d)
Cyclle Time (nano second
10
3 of magnitude,
i.e., 10 times)
2
10
1
10
0
1955 1960 1965 1970 1975 1980 1985 1990 2000 2005
Cache Memory
CPU Main Memory
addresses
instructions
and data
addresses instructions
and data
Cache addresses
iinstructions
t ti
and data
 A cache is a very fast memory which is put between

the main memory and the CPU, and used to hold
segments of program and data of the main memory.
7
Zebo’s Cache Memory Model
Personal library for a high-speed reader
Cache
Storage cells Memory controller
 A computer is a “predictable and iterative reader,” therefore

high cache hit ratio, e.g., 96%, is achievable, even with a
relatively small cache.
Cache Memory Features

 It is transparent to the programmers.
 It is only a small part of the program/data in the main memory
which has its copypy in the cache ((e.g.,
g , 8KB cache with 8MB
memory).
 If the CPU wants to access program/data not in the cache (called a
cache miss), the relevant block of the main memory will be copied
into the cache.
 The intermediate-future memory access will usually refer to the
same word or words in the neighborhood,
neighborhood and will not have to
involve the main memory.
 This property of program executions is denoted as locality of reference.
8
Locality of Reference
 Temporal locality: If an item is referenced, it will
tend to be referenced again soon.
 Spatial locality: If an item is referenced
referenced, items
whose addresses are close by will tend to be
referenced soon.
 This access pattern is referred as locality of
reference principle, which is an intrinsic features
of the von Neumann architecture:
 Sequential instruction storage.
 Loops and iterations (e.g., subroutine calls).
 Sequential data storage (e.g., array).
Layered Memory Performance

Average Access Time 
Phit x Tcache_access +
(1 – Phit) x (Tmm_access + Tcache_access) x Bl
Block_size
k i +
Tchecking
where
Phit = the probability of cache hit, cache hit ratio;
Tcache_access = cache access time;
Tmm_access = main memory access time;
Bl k i = number
Block_size b off words
d iin a cache
h blblock;
k and
d
Tchecking = the time needed to check for cache hit or miss.
Ex. A computer has 8MB MM with 100 ns access time, 8KB cache with 10 ns
access time, BS=4, and Tchecking = 0, Phit = 0.97, AAT will be 22.9 ns.
9
Cache Design
 The size and nature of the copied block must be care-
fully designed, as well as the algorithm to decide which
block to be removed from the cache when it is full:
 Cache block size (line size).
 Total cache size.
 Mapping function.
 Replacement method.
 Write policy.
 Numbers of caches:
• Single, two-level, or three-level cache.
• Unified vs. split cache.
Split Data and Instruction Caches?

 Split caches (Harvard Architectures):
+ Competition for the cache between instruction processing and
execution units is eliminated
eliminated.
+ Instruction fetch can proceed in parallel with memory access
from the CPU for operands.
 One may be overloaded while the other is under utilized.
 Unified caches:
+ Better balance the load between instruction and data fetches
depending on the dynamics of the program execution.
+ Design and implementation are cheaper.
 Lower performance.
10
Direct Mapping Cache
 Direct mapping - Each block of the main memory
is mapped into a fixed cache slot.
1
2
1
1
2
2
Cache
Direct Mapping Cache Example

We have a 10,000-word MM and a 100-word cache. 10 memory
cells are grouped into a block.
Tag Slot Word
y address =
Memory 2 1 1
9990-9999 0 1 1 5
Tag Slot No.
9 90-99
8 80-89
0120-0129 7 70-79
0110-0119 6 66-69
0100-0109 5 50-59
4 40-49
3 30-39
0020-0029 2 20-29
0010-0019 1 10-19
0000-0009 0 00-09
10,000-Word Memory 100-Word Cache
11
Direct Mapping Pros & Cons
 Simple to implement and therefore inexpensive.
 Fixed location for blocks.
 If a program accesses 2 blocks
bl k that
th t map tot the
th same
cache slot repeatedly, cache miss rate is very high.
1
2
1
1
2
2
Cache
Associative Mapping
 A main memory block can be loaded into any slot of the cache.
 To determine if a block is in the cache, a mechanism is needed
to simultaneously examine every slot’s tag.
associative memory example
9990-9999
Tag (3 ps)
Tag 90-99
80-89
0120-0129 70-79
0110-0119 66-69
0100-0109
0100 0109 0106 , 0107 50-59
40-49
010 30-39
0020-0029 287 20-29
0010-0019 001 10-19
0000-0009 297 00-09
10,000-Word Memory 100-Word Cache
12
Fully Associative Organization
Set Associative Organization

 The cache is divided into a number of sets (K).
 Each set contains a number of slots (W).
 A given
i bl
block
k maps tto any slot
l t iin a given
i set.
t
 e.g. block i can be in any slot of set j.
 For example, 2 slots per set (W = 2):
 2-way associative mapping.
 A given block can be in one of 2 slots.
 Direct mapping: W = 1 (no alternative).
 Fully associative: K = 1 (W = total number of all
slots in the cache, all mappings possible).
13
Replacement Algorithms
 With direct mapping, it is no need.
 With associative mapping, a replacement algorithm is
needed in order to determine which block to replace:
First-in-first-out (FIFO).
Use info Tag
Least-recently used (LRU) -
replace the block that has been
in the cache longest with not
reference to it.
Lest-frequently
Lest frequently used (LFU) -
replace the block that has
experienced the fewest 15:55
references.
Random.
Write Policy
 The problem:
 How to keep cache content and main memory content
consistent without losing too much performance?
 Write through:
 All write operations are passed to main memory:
If the addressed location is currently in the cache, the
cache is updated so that it is coherent with the main
memory.
 For writes, the processor always slows down to main
memory speed.
 Since the percentage of writes is small (ca. 15%), this
scheme doesn’t lead to large performance reduction.
14
Write Policy (Cont’d)
 Write through with buffered write:
 The same as write-through, but instead of slowing the processor
down by writing directly to main memory, the write address and data
are stored in a high-speed
high speed write buffer; the write buffer transfers
data to main memory while the processor continues its task.
 Higher speed, but more complex hardware.
 Write back:
 Write operations update only the cache memory which is not kept
coherent with main memory. When the slot is replaced from the
cache,
h itits content
t th has tto b
be copied
i dbback
k tto memory.
 Good performance (usually several writes are performed on a cache
block before it is replaced), but more complex hardware is needed.
Cache coherence problems are very complex and difficult to

solve in multiprocessor systems (to be discussed later)!
Cache Architecture Examples

 Intel 80486 (introduced 1989)
 a single on-chip cache of 8 Kbytes
 line size: 16 bytes
 4
4-way
way set associative organization
 Intel Pentium (introduced 1993)

 two on-chip caches, one for data and one for instructions
 each cache: 8 Kbytes
 2-way set associative organization
 IBM PowerPC 620 (introduced 1995)

 two on-chip caches, one for data and one for instructions
 each cache: 32 Kbytes
 8-way set associative organization
15
Cache Architecture Examples (Cont’d)
 Intel Itanium 2 (introduced 2002)  three levels of cache:
L1 L2 L3
Contents Split D and I Unified D + I Unified D + I
Size 16 Kbytes each 256 Kbytes 3 Mbytes
Line size 64 bytes 128 bytes 128 bytes
Associativity 4 way 8 way 12 way
Access time 1 cycle 5-7 cycles 14-17 cycles
Store policy Write-through Write-back Write-back
 Cache memory
 Virtual Memory
16
Motivation for Virtual Memory
 The physical main memory (RAM) is very limited in space.
 It may not be big enough to store all the executing
programs at the same time.
time
 Some program may need memory larger than the main
memory size, but not all the program need to be
maintained in the main memory at the same time.
 Virtual Memory takes advantage of the fact that at any
given instant of time, an executing program needs only a
fraction of the memory that the whole program occupies.
 The basic idea: Load only pieces of each executing
program which are currently needed.
Paging
 Divide programs (processes) into equal sized, small
blocks, called pages.
 Divide the p
primary
y memoryy into equal
q sized,, small
blocks called page frames.
 Allocate the required number of page frames to a
program.
 A program does not require continuous page
frames!
 The operating system (OS) is responsible for:
 Maintaining a list of free frames.
 Using a page table to keep track of the mapping
between pages and page frames.
17
Logical and Physical Addresses
Implementation of
th page-tables:
the t bl
 Main memory —
slow since an
extra memory
access is needed.
0  Separate registers
1 — fast but
2 expensive.
3
 Cache.
Virtual Memory
 Demand paging
 Do not require all pages of a process in memory.
 Bring in pages as required.
 Page fault
 Required page is not in memory.
 Operating system must swap in the required page.
 May need to swap out a page to make space.
 Select page to throw out based on recent history.
18
Objective of Virtual Memory
 To provide the user/programmer with a much bigger
memory than the main memory with the help of the
operative system.
 Virtual memory >> real memory
MM addresses
Program addresses
0000
0000
1000 Secondary memory
1000
2000
2000
3000
3000
4000
5000
Page Fault
 When accessing a VM page which is not in the main memory,
a page fault occurs. The page must then be loaded from the
secondary memory into the main memory by the OS.
Vi t l Add
Virtual Address
Page Number Offset
Page Map
Page Fault
(Interrupt to OS)
Pages in MM
19
Page Replacement
 When a page fault occurs and all page frames are
occupied, one of them must be replaced.
 If the replaced page has been modified during the time it
resides in the main memory, the updated version should
be written back to the secondary memory.
 Our wish is to replace the page which will not be
accessed in the future for the longest amount of time.
 Problem — We don’t know exactly what will happen in
the future.
 Solution — We predict the future by studying the access
patterns up till now (“learn from history”).
Replacement Algorithms
 FIFO (First In First Out) — To replace the one in
MM the longest of time.
 LRU (Least Recently Used) — To replace the
one that has not be accessed the longest time.
 LFU (Least Frequently Used) — To replace the
one that has the smallest number of access
during the latest time period.
The replacement by random (used for Cache)

is not used for VM!
20
Summary
 A memory system has to fit very large programs and still
provide fast access.
 No single
g type
yp of memory y can p
provide all the need of a
computer system.
 Usually several different storage mechanisms are
organized in a layer hierarchy.
 Cache is a hardware solution to improve memory access which
is transparent to the programmers.
 Virtual memory provides a much larger address space than the
available physical space with the help of the OS (software
solution).
 The layer structure works very well partly due to the
locality of reference principle.
21

Lec2 PDF

Uploaded by

Copyright:

Available Formats

Lec2 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec2 PDF

Uploaded by

Copyright:

Available Formats

Lecture 2: Memory Systems

Internal and External Memories

a bit Read/write control Address selection

MBR (in CPU) MAR (in CPU)

Quantitative measurement of the capacity

 There are two basic techniques to increase the bandwidth

Control Control Control Control

Lecture 2: Memory Systems

 The reality is:

Lecture 2: Memory Systems

 A cache is a very fast memory which is put between

 A computer is a “predictable and iterative reader,” therefore

Cache Memory Features

Layered Memory Performance

Split Data and Instruction Caches?

Direct Mapping Cache Example

Set Associative Organization

Cache coherence problems are very complex and difficult to

Cache Architecture Examples

 Intel Pentium (introduced 1993)

 IBM PowerPC 620 (introduced 1995)

Contents Split D and I Unified D + I Unified D + I

Size 16 Kbytes each 256 Kbytes 3 Mbytes

Line size 64 bytes 128 bytes 128 bytes

Associativity 4 way 8 way 12 way

Access time 1 cycle 5-7 cycles 14-17 cycles

Store policy Write-through Write-back Write-back

Lecture 2: Memory Systems

The replacement by random (used for Cache)

You might also like