Computer Architecture Slides
Computer Architecture Slides
Architecture
Acknowledgment
Adapted from Prof. Zheng & Prof. Katkoori’s slides
1
Ubiquitous Usage of Computers
Embedded
General purpose applications
computing
~2% ~98%
Portable
applications:
mobile
computing,
wireless,
multimedia etc.
PERFORMANCE
4
What is Computer Architecture?
5
Abstractions in Modern Computing Systems
6
How Do the Pieces Fit Together?
Algorithms,
Application programming
Operating Operating system
System
Compiler Firmware Compilers
Instruction Set Memory Instr. Set Proc. I/O system
Architecture
system This course
Datapath & Control CSD
Digital Design Logic design
Circuit Design CMOS VLSI, low-
power
8
A Processor Example
DRAM BANKS
➺ The objective of this course is to study most of
these components
9
Another CPU – Intel Core i7 (2008)
10
A System-Level View
Processor
Graphics
Memory
IO Bus
Processor
Interface
Disk/USB
Interface
11
System Design Parameters
➺ Performance (Speed)
➺ Cost
➺ Power (static + dynamic)
➺ Peak power
➺ Average power
➺ Robustness
➺ Noise-tolerance
➺ Radiation-hardness
➺ Testability
➺ Reconfigurability
➺ Time-to-market etc.
12
Classes of Computers
RISC
1.1 Introduction
Computer Technology Driving Forces
➺ Improvements in semiconductor technology
➺ Feature size, clock speed, cost
➺ Improvements in computer architectures
➺ Enabled by high-level language compilers, UNIX
➺ Led to RISC architectures
19
Technology Trends
➺ Integrated circuit technology
➺ Transistor density: +35%/year Important to design for
➺ Die size: +10-20%/year the next generation of
➺ Integration overall: +40-55%/year technology!
➺ DRAM capacity: +25-40%/year (slowing)
➺ Foundation of main memory.
25
Performance Measures
➺ Bandwidth or throughput
➺ Total work done in a given time
➺ Important for servers and data center operators
o Latency
→ 30-80X improvement for
processors
→ 6-8X improvement for
memory and disks
o Improvement in
Bandwidth = square of
improvement in latency
Log-log plot of bandwidth and
latency milestones
1.4 Trends in Technology
Power and Energy
➺ Problem: Get power in, distribute it, get it out
➺ Dynamic power
34
Factors Affecting Cost
➺ Cost: design, manufacturing, testing, material, etc.
➺ Design – one time cost
➺ Manufacturing ... – recurring cost
die
wafer
37
Cost of Integrated Circuit Cost
➺ Cost of Integrated circuit
➺ Bose-Einstein formula:
39
Principles of Computer Design
40
Measuring Performance
➺ Typical performance metrics:
➺ Response (execution) time & throughput
➺ Speedup of X relative to Y
➺ Execution timeY / Execution timeX
➺ Execution time
➺ Wall clock time: includes all system overheads
➺ CPU time: only computation time
➺ Benchmarks
➺ Kernels (e.g. matrix multiply)
Can be mis-
➺ Toy programs (e.g. sorting) leading
➺ Synthetic benchmarks (e.g. Dhrystone)
➺ Benchmark suites (e.g. SPEC06fp, TPC-C)
48
Where is the Market?
Embedded
1200 1122 Desktop
Millions of Computers
Servers It is estimated
1000 892 862 that there will be
800 between 15 and 20
billion devices
600 488 (small embedded
400 290 devices) with a
$900 Billion USD
200 93 114 135 129 131 market, which is
3 3 4 4 5
0 growing twice as
1998 1999 2000 2001 2002 fast as the PC
market.
➺ Module reliability
➺ Mean time to failure (MTTF)
➺ Mean time to repair (MTTR)
➺ Mean time between failures (MTBF) = MTTF + MTTR
➺ Availability = MTTF / MTBF
➺ = a ratio between service time and total life span
1.7 Dependability 51
Instruction Set Principles
1
Objective and Reading
➺ Objective
➺ understand issues and tradeoffs in instruction set design
➺ Reading
➺ Computer Architecture: A Quantitative Approach
➤ Appendix A
➺ recommended: Computer Organization and Design: The
Hardware/Software Interface
➤ Chapter 2
2
Abstractions
➺ Instruction set architecture
(ISA)
➺ The hardware/software interface Problem
with complexity
Runtime System
➺ Hide lower-level detail (VM, OS, MM)
➺ Implementation ISA
➺ Compiler
➺ Translate HL prgm to assembly
➺ Assembler
➺ Translate assembly to machine code
A.1 Introduction 4
Program Execution Model
uP
➺ A computer is just a FSM
➺ States stored in registers, memory, PC, etc
➺ States changed by instruction execution Mem
➺ An instruction is executed in
➺ Fetch an instruction into CPU from memory
➺ Decode it to generate control signals
➺ Execute it (add, mult, etc) Fetch
➺ Write back output to reg or memory Decode
Execute
➺ Programs and data coexist in memory Writeback
➺ How to distinguish program from data?
A.1 Introduction 5
What Makes a Good ISA?
➺ Programmability
➺ Who does assembly programming these days?
➺ Performance/Implementability
➺ Easy to design high-performance implementations?
➺ Easy to design low-power/energy implementations?
➺ Easy to design low-cost implementations?
➺ Compatibility
➺ Easy to maintain as languages, programs evolve
➺ x86 (IA32) generations: 8086, 286, 386, 486, Pentium,
Pentium-II, Pentium-III, Pentium4, Core2, Core i7, ...
A.1 Introduction 6
Performance
➺ Execution time = IC * CPI * cycle time
➺ IC: instructions executed to finish a program
➺ Determined by program, compiler
➺ CPI: number of cycles needed for each
instruction
➺ Determined by compiler, micro-architecture
➺ Cycle time: inverse of clock frequency
➺ Determined by micro-arch. & technology
➺ Ideally optimize all three
➺ Their optimizations often against each other
➺ Compiler plays a significant role in reducing IC
A.1 Introduction 7
Instruction Granularity
half word
#mod 2
Word
#mod 4
Double
word
#mod 8
Aligned if Addr mod size = 0
A.3 Memory Addressing 14
Memory Addressing
➺ Alignment
➺ Or there no alignment restrictions
half word
Word
Double
word
add R4 #3
SPEC CPU2000 on Alpha
26
Instruction Encoding
➺ Affects code size and implementation efficiency
length: 1 – 17 bytes
➺ Variable Length
➺ More complex, harder to decode
➺ More compact, efficient use of memory
➤ Fewer memory references
➤ Advantage possibly mitigated by use of cache
➺ Complex pipeline: instructions vary greatly in both size and
amount of work to be performed
32
Compilers and Optimization
34
Example Optimizations
➺ Eliminate unreachable expressions
➺ Easy to find in the graph – no edges in
➺ No performance impact / reduces code size
➺ Find and identify common expressions
➺ E.g. a mathematical expression that is computed multiple
times
➺ Automatically remove subsequent computations if no
variables have changed
35
Example Optimizations (cont.)
➺ Simplify algebraic expressions
➺ X = 3*4 + y X = 12 + y [“constant propagation”]
➺ X=Y Z = Y + 5; [“variable propagation”]
Z = X + 5;
➺ X=Y*8 X = Y << 3 [“strength reduction”]
➺ Optimize loops
➺ Move loop invariant instructions outside of the loop
➺ Merge loops whenever possible
➺ Unroll loops if it is cheaper than tracking an iterator
36
Effect of Compiler Optimizations
Compiled with gcc for Pentium 4 under Linux
38
RISC Philosophy
41
Example Toolchain Output
riscv64-unknown-elf-objdump –d test.o
test.
o
42
What is RISC-V?
➺ Base ISAs
➺ RV32I: base 32-bit integer instructions (32b registers)
➺ RV32E: base 32-bit integer instructions (32b registers)
➺ RV64I: base 64-bit integer instructions (64b registers)
➺ Extensions
➺ M: Integer Multiplication and Division
➺ A: Atomic Instructions
➺ F: Single-Precision Floating-Point
➺ D: Double-Precision Floating-Point
➺ C: Compressed Instructions (16 bit)
45
RV32 Registers
➺ 32 FP registers: f0 – f31
➺ optional
47
RV Data Types
➺ 8-bit bytes
➺ 16-bit half-word: common in C
➺ 32-bit words, integer, single-precision FP
➺ 64-bit double-words, integer double-precision FP
➺ Immediate: 12 bits
➺ Displacement: 12 bits
➺ Register indirect
➺ simulated by placing 0 for displacement
See App A and the RISC-V ISA Manual for more information
55
Types of Instructions
1
Objective and Reading
➜ Objective
➜Review of basic pipelining
architecture – how it
improves performance, and its issues
➜Introduce ILP
➜ Reading
➜Appendix C.1
➜Chapter 3.1
2
A Simplified View of Computers
CPU
I$ D$
Memory
Interconnect
HD DISP ... KB
3
C.1 Introduction
C.1 Introduction
4
Introduction
➜ Design Principle – exploit parallelism
➜ Pipelining become universal technique in 1985
➜ Overlaps execution of instructions
➜ Exploits Instruction Level Parallelism (ILP) Fetch
Decode
➜ Two main approaches to detect ILP Execute
➜ Hardware-based dynamic approaches Writeback
➜ Used in server and desktop processors
➜ Not used as extensively in PMD processors
➜ Compiler-based static approaches
➜ Not as successful outside of scientific applications
C.1 Introduction 5
Instruction Execution of RISC
➜ Initial State: PC is set to point to the first instruction
C.1 Introduction 6
Instruction Execution
➜ Instruction Fetch:
➜ Send PC to memory, assert MemRead signal
➜ Instruction fetched from memory
➜ Place instruction in IR: IR Mem[PC]
➜ Update PC to next instruction: PC [PC] + 4
C.1 Introduction 7
Instruction Execution
➜ Instruction Decode:
➜ Instruction in IR decoded by control logic, instruction type and
operands determined
➜ Source operands read from general purpose register file, etc
C.1 Introduction 8
Instruction Execution
➜ Execute:
➜ ALU operates on operands prepared in previous cycle
➜ One of four functions depending upon opcode
➜ Memory Reference
➜ Form effective address from base register and immediate offset
➜ ALU Output [A] + Imm
➜ Register-Register ALU Instruction
➜ ALU Output [A] op [B]
➜ Register-Immediate ALU Instruction
➜ ALU Output [A] op Imm
➜ Branch
➜ Compute branch target by adding Imm to PC
➜ ALU Output [PC] + (Imm << 2)
➜ Evaluate the branch condition
C.1 Introduction 9
Instruction Execution
➜ Memory Access:
➜ For load instructions, read data from memory
➜ For store instructions, write data to memory
C.1 Introduction 10
Instruction Execution
➜ Write-back:
➜ Results written to destination register
➜ Results from mem read or ALU
C.1 Introduction 11
Instruction Execution – Example
add X3, X4, X5 ; X3[X4]+[X5]
➜ Source registers: X4, X5 Destination register: X3
➜ Instruction steps:
➜ Fetch: Fetch the instruction into IR and increment the program
counter
➜ Decode: Decode the instruction in IR to determine the
operation to be performed (add). Read the contents of
registers X4 and X5
➜ Execute: Compute the sum [X4] + [X5]
➜ Memory Access: No action, since there are no memory
operands
➜ Write-back: Write the result into register X3
C.1 Introduction 12
Instruction Execution – Example
ld X5, N(X7) ; X5[[X7]+N]
➜ Source register: X7 Destination register: X5
➜ Immediate value N is given in the instruction word
➜ Instruction steps:
Fetch: Fetch the instruction and increment the program counter
Decode: Decode the instruction in IR to determine the operation
to be performed (load). Read the contents of register X7
Execute: Add the immediate value N to the contents of X7
Memory Access: Use the sum N+[X7] as the effective address of
the source operand, read the contents of that location from
memory
Write-back: Write the data received from memory into register X5
C.1 Introduction 13
Instruction Execution – Example
sd X6, N(X8) ;Mem[N+[X8]][X6]
➜ Source registers: X6, X8 Destination register: None
➜ The immediate value N is given in the instruction word
➜ Instruction steps:
➜ Fetch: Fetch the instruction and increment the program
counter
➜ Decode: Decode the instruction in IR to determine the
operation to be performed (store). Read the contents of
registers X6 and X8.
➜ Execute: Compute the effective address N + [X8]
➜ Memory Access: Store the contents of register X6 into memory
location N + [X8]
➜ Writeback: No action
C.1 Introduction 14
Basic Pipeline
C.1 Introduction 15
• Pipeline clock
cycle
determined by
the slowest
stage.
• Pipeline registers add extra
overhead.
C.1 Introduction 16
Ideal Pipeline Performance
➜ Balanced pipeline (each stage has the same delay)
➜ Ignore overhead due to clock skew and pipeline registers
➜ Ignore pipeline fill and drain overheads
C.1 Introduction 17
Pipeline Performance
➜ Example: A program consisting of 500 instructions is executed on
a 5-stage processor. How many cycles would be required to
complete the program. Assume ideal overlap in case of pipelining.
➜ Without pipelining:
➜ Each instruction will require 5 cycles. There will be no overlap
amongst successive instructions.
➜ Number of cycles = 500 * 5 = 2500
➜ With pipelining:
➜ Each pipeline stage will process a different instruction every
cycle. First instruction will complete in 5 cycles, then one
instruction will complete in every cycle, due to ideal overlap.
➜ Number of cycles = 1*5 + (499*1) = 504
➜ Speedup with pipelining = 2500/504 = 4.96
C.1 Introduction 18
Pipeline Performance
➜ Problem: Consider a non-pipelined processor using the 5-
stage datapath with 1 ns clock cycle. Assume that due to
clock skew and pipeline registers, pipelining the processor
adds 0.2 ns of overhead to the clock cycle. How much
speedup can we expect to gain from pipelining? Assume a
balanced pipeline and ignore the pipeline fill and drain
overheads. (A similar ex. in the book)
➜ Solution:
➜ Without pipelining: Clock period = 1 ns, CPI = 5
➜ With pipelining: Clock period = 1 + 0.2 = 1.2 ns, CPI = 1
C.1 Introduction 19
Pipeline Performance
➜ The potential increase in performance resulting from
pipelining is proportional to the number of pipeline
stages
➜ However, this increase would be achieved only if
➜ all pipeline stages require the same time to complete, and
➜ there is no interruption throughout program execution
C.1 Introduction 20
Pipeline Performance – cont’d
21
Pipeline Stalls
Ij
Ij+1
Ij+2
➜ Assume that Instruction Ij+1 is stalled in the decode stage for two extra
cycles
➜ This will cause Ij+2 to be stalled in the fetch stage, until Ij+1 proceeds
➜ New instructions cannot enter the pipeline until Ij+2 proceeds past the
fetch stage after cycle 5 => execution time increases by two cycles
C.1 Introduction 22
Summary
➜ Instruction executes in a sequence of stages
➜ Multiple instructions can execute at the same time in
different stages
➜ Instruction-level parallelism (ILP)
23
Memory Hierarchy Design
1
Recall Pipeline Performance
➺ Pipeline stages need to be balanced
➺ the clock is fixed by the slowest stage.
➺ Ideal case, every stage has same latency
➺ In reality, stages are unbalanced
➺ memory accesses are much slower
➺ Example: MEM 10 ns latency, other stages 2 ns
latency
➺ clock cycle time = 10 ns
➺ How do reduce MEM stage latency?
➺ Memory hierarchy
A Simplified View of Computers
CPU
I$ D$
Memory
Main Memory
Interconnect
HD DISP ... KB
Objectives and Reading
➺ Objectives
➺ Understand memory hierarchy organizations and their
impacts on performance
➺ Evaluate performance tradeoffs of different memory
hierarchy organizations
➺ Reading
➺ Computer Architecture: A Quantitative Approach
➤ Appendix B, Chapter 2
➺ Computer Organization and Design: The
Hardware/Software Interface
➤ Chapter 5
4
Memory Technology – Overview
➺ Static RAM (SRAM)
➺ 0.5ns – 2ns, $2000 – $5000 per GB
➺ Dynamic RAM (DRAM)
➺ 20ns – 30ns, $10 – $50 per GB
➺ Flash – non-volatile
➺ 20 – 100 us, 5-10x cheaper than DRAM
➺ Magnetic disk
➺ 5ms – 20ms, $0.20 – $2 per GB
➺ Ideal memory
➺ Access time of SRAM,
➺ Capacity and cost/GB of disk
5
The “Memory Wall” Problem
7
The “Memory Wall” – A Multi-Core Case
➺ Aggregate peak bandwidth grows with # cores:
➺ Intel Core i7 can generate two references per core per clock
➺ Four cores and 3.2 GHz clock
25.6 billion 64-bit data references/second +
12.8 billion 128-bit instruction references/second
= 409.6 GB/sec
➺ DRAM bandwidth is only 6% of this (25 GB/s)
➺ How does memory meet processor bandwidth demand?
• Multi-port, pipelined caches
• Two levels of cache per core
• Shared third-level cache on chip
2.1 Introduction 8
Principle of Locality – Review
➺ Temporal locality
➺ Programs often access a small proportion of their
address space at any time
➺ Items accessed recently are likely to be accessed again
soon
➺ e.g., instructions and variables in a loop
➺ Spatial locality
➺ Items near those accessed recently are likely to be
accessed soon
➺ E.g., sequential instruction access, array data
B.1 Introduction 9
Principle of Locality – Review
➺ Identify Temporal and spatial locality
int sum = 0;
int x[1000];
B.1 Introduction 10
Memory Hierarchy – Basic Idea
➺ Ideally memory = unlimited capacity with low latency
➺ Fast memory technology is more expensive per bit than
slower memory
➺ Solution: organize memory system into a hierarchy
➺ Entire addressable memory space available in largest, slowest
memory
➺ Incrementally smaller and faster memories, each containing a
subset of the memory below it, proceed in steps up toward the
processor
➺ Temporal and spatial locality ensures that nearly all
references can be found in smaller memories
➺ Gives the illusion of a large, fast memory being presented to
the processor
2.1 Introduction 11
Memory Hierarchies
Mobile devices
Desktop
Servers
2.1 Introduction 12
Energy Consumptions
Song Han, FPGA’17 talk, “Deep Learning – Tutorial and Recent Trends”
word block
Main
CPU Cache
Memory
B.1 Introduction 3
Direct Mapped Cache
➺ Only one choice
➺ cache index = (Block address) MOD (#Blocks in cache)
◼ #Blocks is a
power of 2
◼ Use low-order
address bits to
access bytes in
a block
B.1 Introduction 5
Tags and Valid Bits
➺ One cache line Multiple memory blocks
➺ Cache line is aka cache block
B.1 Introduction 6
Address Subdivision
◼ Example: 1024-word cache (32-bit words → 4 kB cache)
◼ Cache has 210 words, so 10 bits are used for the index
7
Address Subdivision
8
Another Cache Example
◼ Direct mapped cache (assume 32-bit address space)
◼ 2048 blocks, each holds 32 bytes of data
◼ How many bytes can we store in the cache?
9
Another Cache Example
◼ Direct mapped cache (assume 32-bit address space)
◼ 2048 blocks, each holds 32 bytes of data
◼ How many bytes can we store in the cache?
◼ 2048 x 32 bytes = 65536 bytes = 64K cache
10
Another Cache Example
◼ Direct mapped cache (assume 32-bit address space)
◼ 2048 blocks, each holds 32 bytes of data
◼ How many bytes can we store in the cache?
◼ 2048 x 32 bytes = 65536 bytes = 64K cache
11
Another Cache Example
◼ Direct mapped cache (assume 32-bit address space)
◼ 2048 blocks, each holds 32 bytes of data
◼ How many bytes can we store in the cache?
◼ 2048 x 32 bytes = 65536 bytes = 64K cache
12
Another Cache Example
◼ Direct mapped cache (assume 32-bit address space)
◼ 2048 blocks, each holds 32 bytes of data
◼ How many bytes can we store in the cache?
◼ 2048 x 32 bytes = 65536 bytes = 64K cache
◼ Address division:
◼ 211 blocks → 11 bit index
◼ 22 bytes / word → ignore 2 LSBs of address [used after loading desired word
from block]
◼ 32 – 11 – 3 – 2 = 16 bit tag
15
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
16
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words?
17
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words
18
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words
19
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words
20
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words
21
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words
22
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words
23
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words
24
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words
25
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words
26
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words
27
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words
28
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words
V tag block 0
V tag block 1
address …
32
V tag block 63
B.1 Introduction 31
Example: Larger Block Size
➺ Cache: 64 blocks, 16 bytes/block
➺ To what block number does address 1200 map?
➺ Block address = 1200/16 = 75
➺ Block offset value = 1200 modulo 16 = 0
➺ Cache block index = 75 modulo 64 = 11
31 10 9 4 3 0
Tag Index Offset
22 bits 6 bits 4 bits
B.1 Introduction 32
Cache Example
➺ 8-blocks, 1 word/block, direct mapped
➺ Initial state
B.1 Introduction 34
Cache Example
Word addr Binary addr Hit/miss Cache block
2 26 11 010 Miss 010
B.1 Introduction 35
Cache Example
Word addr Binary addr Hit/miss Cache block
3 16 10 000 Miss 000
4 3 00 011 Miss 011
5 16 10 000 Hit 000
B.1 Introduction 36
Cache Example
Word addr Binary addr Hit/miss Cache block
18 10 010 Miss 010
B.1 Introduction 37
Cache Misses
B.1 Introduction 38
Cache Misses
➺ Miss rate
➺ Fraction of cache access that result in a miss
➺ Causes of misses
➺ Compulsory
➤ First reference to a block
➺ Capacity
➤ Blocks discarded and later retrieved due to cache capacity limit
➺ Conflict
➤ Program makes repeated references to multiple addresses from different
blocks that map to the same location in the cache
➤ Only happen in direct-mapped or set associative caches
2.1 Introduction 39
Measuring Cache Performance
➺ CPU time = (CPU cycles + mem stall cycles) * cycle time
➺ CPU cycles = IC*Ideal CPI (CPU cycles under cache hits)
➺ Memory stall cycles = additional cycles for handling cache
misses
latency of accessing
lower-level memory
➺ Example
➺ CPU with 1ns clock cycle time, hit time = 2 cycle, miss
penalty = 20 cycles, I-cache miss rate = 5%
➺ AMAT = 2 + 0.05 × 20 = 3 cycles
B.1 Introduction 44
Associative Cache Example
Fully Associative Direct Mapped 2-way Associative
B.1 Introduction
4-way Associative Cache Organization
3
Associativity Example
➺ Compare 4-block caches, block size = 1 byte
➺ Direct mapped, 2-way set associative, fully associative
➺ Sequence of addresses: 0, 8, 0, 6, 8
➺ Direct mapped
B.1 Introduction 48
Associativity Example
➺ Fully associative
B.1 Introduction 49
Spectrum of Associativity
➺ For a cache with
8 entries
B.1 Introduction 50
Size of Tags vs Associativity
➺ No. address bits = 32
➺ Block size = 4 bytes
➺ Cache size = 16KB →
➺ Direct mapping
31 14 13 2 1 0
➺ Tag bits = 18x212
➺ Comparators = 1
18 bits
➺ 4-way set-associative
31 12 11 2 1 0
➺ Tag bits= 4x210x20 = 20x212
20 bits
➺ Comparators = 4
➺ Fully associative
31 2 10
➺ Tag bits = 30x212
➺ Comparators = 212
30 bits
B.1 Introduction 51
Size of Tags vs Associativity
➺ Increasing associativity requires
➺ More tag bits per cache block
➺ More comparators, each of which is more complex
➤ higher hit time, larger circuits
➺ The choice among direct, set-associative and
fully-associative mapping in any memory
hierarchy will depend on
➺ Miss rate vs cost of implementing associativity, both in
time and in extra hardware overhead
➺ Random
➺ Gives approximately the same performance as LRU for
high associativity
B.1 Introduction 53
Write Policy – Write-Through
➺ Update cache and memory
together
CPU
➺ Cache and memory data remain the
same
➺ Easier to implement Cache
➺ Writes take longer – wait for mem
update to complete
➺ e.g., if base CPI = 1, 10% of Memory
instructions are stores, memory write
latency is 100 cycles
➤ Effective CPI = 1 + 0.1×100 = 11
B.1 Introduction 54
Write-Through with Write Buffer
B.1 Introduction 55
Write Policy – Write-Back
➺ Just update the block in cache
➺ Keep track of whether each block is CPU
dirty - dirty bits
➺ Inconsistent data in $ and Mem
➺ When a dirty block is replaced, Cache
write it back to memory
➺ Write speed is faster Memory
➺ One low-level memory update for
multiple cache writes.
➺ Energy saving.
➺ Write buffer can also be used
B.1 Introduction 56
Write Allocation
➺ What should happen on a write miss?
➺ Write allocate
➺ No-write allocate
B.1 Introduction 57
Write Miss Policies – Example
write M[100]
write M[100]
read M[200]
write M[200]
write M[100]
B.1 Introduction 58
Handling Writes
Allocate Non-Allocate
• Write to mem on
• Allocate on write miss
miss
Write-Through • Write to cache & mem
• no cache block
in parallel
fetch
59
Cache Performance – Example
Base CPI = 1,
cycle time = 1ns
Hit time(DM) = 1 cycle
Miss rate (DM) = 0.021
Hit time (2-way) = 1.35 cycles
Miss rate (2-way) = 0.019
Miss penalty = 150 cycles
Avg mem. req/inst = 1.4
B.1 Introduction 60
Cache Performance – Example
To determine **which cache is faster using CPU time**, we need to
calculate the **total CPU execution time**. The relevant formula is:
\[
\text{CPU Time} = (\text{Instructions} \times \text{CPI}_{\text{effective}})
\times \text{Cycle Time}
Base CPI = 1, \]
Where:
- **CPI\(_{\text{effective}}\)** = Base CPI + Memory Stalls
cycle time = 1ns - **Memory Stalls per Instruction** = Avg. Mem. Requests per Instruction
× AMAT – Hit Time
Hit time(DM) = 1 cycle Let's walk through the calculations for both caches: **Direct-Mapped
(DM)** and **2-Way Set Associative**.
Miss rate (DM) = 0.021 ---
### **1. Direct-Mapped Cache CPU Time Calculation**
- **Base CPI**: 1
Hit time (2-way) = 1.35 cycles
- **Hit Time (DM)**: 1 cycle
- **Miss Rate (DM)**: 0.021
Miss rate (2-way) = 0.019 - **Miss Penalty**: 150 cycles
- **Avg. Memory Requests per Instruction**: 1.4
Miss penalty = 150 cycles - **Cycle Time**: 1 ns
#### **AMAT (Direct-Mapped)**:
\[
Avg mem. req/inst = 1.4 \text{AMAT} = 1 + (0.021 \times 150) = 1 + 3.15 = 4.15 \, \text{cycles}
\]
#### **Memory Stalls per Instruction**:
\[
Which cache is faster using
\text{Memory Stalls} = 1.4 \times (4.15 - 1) = 1.4 \times 3.15 = 4.41 \,
\text{cycles}
\]
CPU time? #### **CPI\(_{\text{effective}}\) for DM**:
\[
\text{CPI}_{\text{effective}} = 1 + 4.41 = 5.41 \, \text{cycles/instruction}
\]
B.1 Introduction --- 61
Cache Performance – Exercise
Base CPI = 1,
cycle time = 1ns
Hit time(DM) = 1 cycle
Miss rate (DM) = 0.021
Hit time (2-way) = 1.35 cycles
Miss rate (2-way) = 0.019
Miss penalty = 200 cycles
Avg mem. req/inst = 1.4
B.1 Introduction 62
Basic Cache Optimizations
Cache Performance – Review
B.1 Introduction 2
Six Basic Cache Optimizations
➺ Larger block size
➺ Reduces compulsory misses
➺ Increases capacity and conflict misses, increases miss penalty
➺ Larger total cache capacity to reduce miss rate
➺ Increases hit time, increases power consumption
➺ Higher associativity
➺ Reduces conflict misses
➺ Increases hit time, increases power consumption
➺ Multi-level cache to reduce miss penalty
➺ Reduces overall memory access time
➺ Giving priority to read misses over writes
➺ Reduces miss penalty
➺ Avoiding address translation in cache indexing
➺ Reduces hit time
CPU
Cache
Memory
AMAT
B.3 Six Basic Cache Optimizations 5
Optimization 1 – Block Size Selection
➺ Determined by lower level memory
CPU
bandwidth = Bytes/s
Cache
bus bandwidth
Memory Mem bandwidth
HT(L1) = 1 cycle
HT(L2) = 10 cycles
MP(L2) = 200 cycles,
Avg M Access/inst = 1.5
Q2: AMAT?
Q3: Avg stall cycles/inst?
Hard Disk
20
Basic Ideas
➺ Reduce hit time, miss rate, and miss penalty
➺ We have seen some optimizations
Tags Data
1
2-bit
2 Way 2-bit
= = = = Predictor
2
<<
=
data
3 match
if mis-match, the prediction is wrong
Figure 2.10 Four-way interleaved cache banks using block addressing. Assuming 64 bytes per block, each of these addresses would
be multiplied by 64 to get byte addressing.
better
➺ Early restart
➺ Fetch the block in normal order
➺ Send missed word to the processor as soon as it arrives
w/o write
merging
w write
merging
0 x[0][0]
/* before */ 1 x[0][1]
for (j=0; j < 16; j++)
for (i=0; i < 16; i++) ... …
x[i][j] = 2*x[i][j]; 15 x[0][15]
16 x[1][0]
… …
0 x[0][0]
/* after*/ 1 x[0][1]
for (i=0; i < 16; i++) ... …
for (j=0; j < 16; j++) 15 x[0][15]
x[i][j] = 2*x[i][j];
16 x[1][0]
… …
better
➺ Associative cache
➺ longer hit time, lower miss rate, higher power consumption
➺ Performance evaluation
➺ AMAT – average memory access time: maybe misleading
➺ CPI with stall cycles due to cache misses: more accurate
L2
CPU
word
Cache
block
Memory
page
Hard Disk
OS App1 App2
… … …
Disk
➺ Provide protection
➺ One process can’t interfere with another
➤
Because they operate in different address spaces
➺ User process cannot access privileged information
➤ Different sections of address space have different permissions
B.4 Virtual Memory 6
A Simplified View of Virtual Memory
Memory
0:
Page Table 1:
Virtual
Addresses 0:
1: Physical
Addresses
CPU
P-1:
N-1:
Disk
B.4 Virtual Memory 7
Cache vs Virtual Memory
From CPU
n–1 p p–1 0
virtual address
virtual page number page offset
[n-1..0]
Address Translation
m–1 p p–1 0
physical address
physical page number page offset
[m-1..0]
To memory
B.4 Virtual Memory 11
Address Translation
page table base register virtual address
n–1 p p–1 0
virtual page number (VPN) page offset
m–1 p p–1 0
physical page number (PPN) page offset
physical address
0 …
…
…
14
Page Faults
➺ What if object is on disk rather than in memory?
➺ Page table entry indicates virtual address not in memory
➺ OS exception handler invoked to move data from disk into
memory
➺ Current process suspends, others can resume
➤ OS has full control over placement, etc.
Disk
B.4 Virtual Memory Disk 15
Servicing Page Faults
(1) Initiate Block Read
➺ Processor signals controller
Processor
➺ Read block of length P starting Reg (3) Read
at disk address X and store Done
starting at memory address Y
Cache
➺ Read occurs
➺ Direct Memory Access (DMA)
Memory-I/O bus
➺ Under control of I/O controller
(2) DMA
➺ I/O controller signals Transfer I/O
completion Memory controller
➺ Interrupt processor
➺ OS resumes suspended
disk
Disk disk
Disk
process
Figure B.24 Operation of the Opteron data TLB during address translation. The four steps of a TLB hit are shown
as circled numbers. This TLB has 40 entries. Section B.5 describes the various protection and access fields of an
Opteron page table entry.
21
Multi-Level Page Table
Virtual address
22
Selecting Page Size – Large Page Size
➺ Smaller page table
➺ A page table can occupy a lot of space
➺ Larger L1 cache
➺ Less TLB misses, faster translation
➺ More efficient to transfer larger pages from 2nd
storage
➺ Wasted memory usage
➺ Wasted IO bandwidth
➺ Slower process start up
➺ Role of architecture:
➺ Provide user mode and supervisor mode
➺ Protect certain aspects of CPU state: PC, register, etc
➺ Provide mechanisms for switching between user mode and
supervisor mode
➺ Provide mechanisms to limit memory accesses
➺ Provide TLB to translate addresses
25
Page Table Entries – Protection
➺ Check bits on each access and during a page fault
➺ If violated, generate exception (Access Protection exception)
26
Summary
➺ OS virtualizes memory and IO devices
➺ Each process has an illusion of private CPU and
memory
➺ Virtual memory
➺ Arbitrarily large memory, isolation/protection, inter-
process communication
➺ Reduce page table size
➺ Translation buffers – cache for page table
➺ Manage TLB misses and page faults
➺ Protection
27