Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
9 views

Computer Architecture Slides

EEL 6764 Principles of Computer Architecture, taught by Dr. Robert Karam at the University of South Florida, covers the fundamentals of computer architecture, including the interconnection of hardware components to meet performance and cost goals. The course explores various levels of abstraction in computing systems, the impact of technology trends on performance, and the principles of computer design. Key topics include power consumption, parallelism, and performance metrics, with a focus on the evolution of computer architectures and their applications.

Uploaded by

ponymoon878
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Computer Architecture Slides

EEL 6764 Principles of Computer Architecture, taught by Dr. Robert Karam at the University of South Florida, covers the fundamentals of computer architecture, including the interconnection of hardware components to meet performance and cost goals. The course explores various levels of abstraction in computing systems, the impact of technology trends on performance, and the principles of computer design. Key topics include power consumption, parallelism, and performance metrics, with a focus on the evolution of computer architectures and their applications.

Uploaded by

ponymoon878
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 274

EEL 6764 Principles of Computer

Architecture

Instructor: Dr. Robert Karam


Dept. of Computer Science & Engineering
University of South Florida
Tampa, FL 33620
Email: rkaram@usf.edu

Acknowledgment
Adapted from Prof. Zheng & Prof. Katkoori’s slides
1
Ubiquitous Usage of Computers
Embedded
General purpose applications
computing

~2% ~98%

➺ Today’s high-end computers have


➺ Tens of Billions of transistors in a chip
➺ Tera-flops (10^12 FPOS) processing power
2
Computer Applications
General purpose
computing: internet
server, database
Ultralow power server, real-time
applications: jobs etc.
medical, space,
specific sensor
network etc.
POWER

Portable
applications:
mobile
computing,
wireless,
multimedia etc.

PERFORMANCE

Different applications have different power-performance demands


3
What is Computer Architecture?

“Computer architecture is the science and art of


selecting and interconnecting hardware components to
create a computer that meets functional, performance,
and cost goals.” *
* G. Sohi, UW

Think about architecture of buildings!

4
What is Computer Architecture?

5
Abstractions in Modern Computing Systems

ISA – the user manual for the programmer


• Defines the interface between the hardware and software
• Defines how the processor is controlled by the software
• Defines what the processor is capable of doing.

uArch – implements the ISA


• Defines the particular implementation
• Defines major structures, e.g. ALU, regfile,interconnects

RTL – behavior of the circuits


• Defines how the major structures function
• Defines how data move between registers and how
data are transformed in the process
• E.g. Behavioral Verilog

6
How Do the Pieces Fit Together?
Algorithms,
Application programming
Operating Operating system
System
Compiler Firmware Compilers
Instruction Set Memory Instr. Set Proc. I/O system
Architecture
system This course
Datapath & Control CSD
Digital Design Logic design
Circuit Design CMOS VLSI, low-
power

➺ Coordination of many levels of abstraction


➺ Under a rapidly changing set of forces
➺ Design, measurement, and evaluation
7
Abstraction Example
 Application – encryption
 Algorithm – AES
 Programming Language – implement in C
 OS/VM – generate Windows binary (exe)
 ISA – targeting x86-64 instruction set
 uArch – Intel Sky Lake (with AES-NI, AES-specific instructions)
 RTL – describes the behavior of AES HW
 Gates – NAND/NOR/XOR gates implementing RTL
 Circuits – physical implementation of gates; transistor, interconnects
 Devices – what kind of transistor? (14 nm FinFET)
 Physics – fundamental principles underlying transistor/circuit operation

8
A Processor Example

DRAM BANKS
➺ The objective of this course is to study most of
these components
9
Another CPU – Intel Core i7 (2008)

10
A System-Level View
Processor

Graphics

Memory

IO Bus

Processor
Interface

Disk/USB
Interface

11
System Design Parameters
➺ Performance (Speed)
➺ Cost
➺ Power (static + dynamic)
➺ Peak power
➺ Average power
➺ Robustness
➺ Noise-tolerance
➺ Radiation-hardness
➺ Testability
➺ Reconfigurability
➺ Time-to-market etc.
12
Classes of Computers

➺ Shipments: 1.9B PMD, 350M Desktop, 19B Embedded.


➺ Desktop has largest $ market share.

1.2 Classes of Computers 13


Single Processor Performance
Move to multi-processor

RISC

1.1 Introduction
Computer Technology Driving Forces
➺ Improvements in semiconductor technology
➺ Feature size, clock speed, cost
➺ Improvements in computer architectures
➺ Enabled by high-level language compilers, UNIX
➺ Led to RISC architectures

➺ Together have enabled:


➺ More powerful and efficient computers.
➺ New classes of computers, i.e. mobile devices, etc.
➺ Penetration of GP CPUs into many applications.
➺ Tradeoff between performance and productivity in
SW development.
1.1 Introduction 15
Current Trends in Architecture
➺ Hurdles
➺ Power Wall – limits on power consumption
➺ Memory Wall – limits on memory access speed
➺ Lack of Instruction-Level parallelism (ILP) to exploit
➺ Single processor performance improvement ended in
2003
➺ New opportunities for improving performance:
➺ Data-level parallelism (DLP)
➺ Thread-level parallelism (TLP)
➺ Request-level parallelism (RLP)

➺ These require explicit restructuring of the applications


➺ Applications must expose parallelism explicitly.
1.1 Introduction 16
Classes of Parallelism
➺ Exploitation of parallelism --> performance

➺ Classes of parallelism in applications:


➺ Data-Level Parallelism (DLP)
➺ Task-Level Parallelism (TLP)

➺ Classes of architectural parallelism:


➺ Instruction-Level Parallelism (ILP)
➺ Vector architectures/Graphic Processor Units (GPUs)
➺ Thread-Level Parallelism (TLP)
➺ Request-Level Parallelism (RLP)

1.2 Classes of Computers 17


Flynn’s Taxonomy
Flynn's Taxonomy is a classification of computer architectures based on the number of instruction streams and data
streams they can process.
➺ Single instruction stream, single data stream (SISD)
➺ Exploit ILP and TLP in some degree

➺ Single instruction stream, multiple data streams (SIMD)


➺ Targets DLP
➺ Vector architectures
➺ Multimedia extensions
➺ Graphics processor units

➺ Multiple instruction streams, single data stream (MISD)


➺ No commercial implementation

➺ Multiple instruction streams, multiple data streams (MIMD)


➺ Targets TLP and RLP
➺ Tightly-coupled MIMD - TLP SD MD
➺ Loosely-coupled MIMD - RLP
SI SISD SIMD
MI X MIMD
1.2 Classes of Computers 18
Technology Trends

19
Technology Trends
➺ Integrated circuit technology
➺ Transistor density: +35%/year Important to design for
➺ Die size: +10-20%/year the next generation of
➺ Integration overall: +40-55%/year technology!
➺ DRAM capacity: +25-40%/year (slowing)
➺ Foundation of main memory.

➺ Flash capacity: +50-60%/year Current technologies


➺ 8-10X cheaper/bit than DRAM approaching their
➺ An order of magnitude slower than DRAM limits; new technologies
➺ Magnetic disk technology: + ~50%/yr being researched
➺ 15-25X cheaper/bit than Flash
➺ 300-500X cheaper/bit than DRAM
➺ Main storage for server or WSC.

1.4 Trends in Technology 20


First Microprocessor

1.4 Trends in Technology 21


Height of Single-Core Processor

1.4 Trends in Technology 22


MultiCore Processor

1.6 billion transistors, released in 2013.


1.4 Trends in Technology 23
Transistors and Wires
➺ Transistor feature size
➺ Minimum size of transistor or wire in x or y dimension
➺ 10 microns in 1971 to .032 microns in 2011
➺ Now in 2020/2023, seeing 7nm – 3nm.

➺ Transistor density grows exponentially.


➺ Moore’s law – has been slowing down

➺ Transistor performance scales linearly


➺ Wire delay does not improve with feature size!
➺ In fact, it is getting worse!
➺ Make on-chip interconnect design an important task!

1.4 Trends in Technology 24


How Many Transistors?

Intel Nehalem (11/2008) Nvidia AD102 (10/2022)


o Gen. 1 Core i7, quadcore CPU o GeForce RTX 4090 [thousands of cores]
o 45 nm tech o 4 nm tech
o 296 mm2 die area o 608 mm2 die area
o 781,000,000 transistors o 76,300,000,000 transistors
o 2.6M transistors / mm2 o 125.5M transistors / mm2

25
Performance Measures
➺ Bandwidth or throughput
➺ Total work done in a given time
➺ Important for servers and data center operators

➺ Latency or response time


➺ Time between start and completion of an event
➺ Important for individual users

1.4 Trends in Technology 26


Bandwidth and Latency
o Bandwidth
→ 10,000-25,000X
improvement for
processors
→ 300-1200X improvement
for memory and disks

o Latency
→ 30-80X improvement for
processors
→ 6-8X improvement for
memory and disks
o Improvement in
Bandwidth = square of
improvement in latency
Log-log plot of bandwidth and
latency milestones
1.4 Trends in Technology
Power and Energy
➺ Problem: Get power in, distribute it, get it out

➺ Maximal power consumption

➺ Thermal Design Power (TDP)


➺ Characterizes sustained power consumption
➺ Used as target for power supply and cooling system
➺ Lower than peak power, higher than average power
consumption

➺ System energy efficiency


➺ Often a better measurement of overall system efficiency
➺ Power = a design constraint

1.5 Trends in Power and Energy 28


Dynamic Energy and Power
➺ Dynamic energy
➺ Transistor switch from 0 -> 1 or 1 -> 0

➺ Dynamic power

➺ Relation between energy and power consumption

➺ P: power, E: energy, f: switching frequency

➺ Reducing clock frequency reduces power, not energy


1.5 Trends in Power and Energy 29
Power and Energy – Example
➺ Suppose two designs have the same clock frequency.
➺ Design 1 consumes, on average, 20% less energy per
operation than Design 2.
➺ Design 1’s power is 20% less than design 2 if they run
at the same speed
➺ Design 1 dissipates less heat with the same
performance of design 2
➺ Or, design 1’s clock frequency can be improved by
25% while consuming the same power as design 2.

1.5 Trends in Power and Energy 30


Power & Clock Frequency
➺ Intel 80386
consumed ~2 W
➺ 3.3 GHz Intel Core i7
consumes 130 W
➺ Heat must be
dissipated from 1.5 x
1.5 cm chip
➺ This is the limit of
what can be cooled
by air.
➺ Performance
improvement by
increasing f is over!

1.5 Trends in Power and Energy 31


Reducing Power
➺ Techniques for reducing power:
➺ Reduce clock frequency
➺ Turn off components not being used – dark silicon
➺ Dynamic Voltage-Frequency Scaling
➺ Low power state for DRAM, disks
➺ Overclocking, turning off all but one core

Energy efficiency has become a critical quality metric


• Measured by tasks/joule or performance/watt

1.5 Trends in Power and Energy 32


Static Power
➺ Transistors are not perfect switches
➺ leakage current even when no switching

➺ Static power consumption


➺ Currentstatic x Voltage
➺ Scales with number of transistors
➺ To reduce: power gating

Leakage Power Trends (2008)

1.5 Trends in Power and Energy 33


Where are Energy/Area Consumed?

34
Factors Affecting Cost
➺ Cost: design, manufacturing, testing, material, etc.
➺ Design – one time cost
➺ Manufacturing ... – recurring cost

➺ Cost driven down by learning curve


➺ Yield: ratio between good products among all.

➺ Increase in volume --> reduced unit cost


● 10% less for each doubling of volume
● Design cost amortized over volume
● Material cost lower on large volume

➺ Operating cost: significant for data center

1.6 Trends in Cost 35


Integrated Circuit: Wafer and Die

die
wafer

1.6 Trends in Cost 36


Intel Core i7 (2nd gen.) Wafer
 300mm wafer, 280 chips, 32nm technology
 Each chip is 20.7 x 10.5 mm

37
Cost of Integrated Circuit Cost
➺ Cost of Integrated circuit

➺ Bose-Einstein formula:

➺ Defects per unit area = 0.016-0.057 defects per square cm


(2010)
➺ N = process-complexity factor = 11.5-15.5 (40 nm, 2010)

1.6 Trends in Cost 38


Manufacturing Process

39
Principles of Computer Design

40
Measuring Performance
➺ Typical performance metrics:
➺ Response (execution) time & throughput
➺ Speedup of X relative to Y
➺ Execution timeY / Execution timeX
➺ Execution time
➺ Wall clock time: includes all system overheads
➺ CPU time: only computation time
➺ Benchmarks
➺ Kernels (e.g. matrix multiply)
Can be mis-
➺ Toy programs (e.g. sorting) leading
➺ Synthetic benchmarks (e.g. Dhrystone)
➺ Benchmark suites (e.g. SPEC06fp, TPC-C)

1.8 Measuring Performance 41


Principles of Computer Design
➺ Take Advantage of Parallelism
➺ e.g. multiple processors, disks, memory banks, pipelining,
multiple functional units
➺ Principle of Locality
➺ A property of programs: data and instructions typical reused

➺ Focus on the Common Case – Amdahl’s Law

1.9 Quantitative Principles of Computer Design 42


Amdahl’s Law - Example
Suppose a new processor is 10 times faster in computation than
the current one. Assume that the current processor is 40% time
busy with computation and 60% of the time idle waiting for IO.
What is the overall speedup with the new processor?

1.9 Quantitative Principles of Computer Design 43


The Processor Performance Equation

ISA arch technology


compiler arch
1.9 Quantitative Principles of Computer Design 44
The Processor Performance Equation
➺ Different instruction types having different CPIs

➺ ICi : count of instruction i in a program.


➺ CPIi : average cycle count for instruction i

1.9 Quantitative Principles of Computer Design 45


Suppose the following measurements
fraction of FP ops = 25%
Average CPI of FP ops = 4 cycles
fraction of FPSQR = 2%
Average CPI of FPSQR = 20 cycles
Average CPI of other inst = 1 cycles

Design #1: decrease CPI of FPSQR to 2


Design #2: decrease CPI of FP ops to 2
Compare these two design alternatives.

1.9 Quantitative Principles of Computer Design 46


Summary
➺ Computer architecture involves ISA, microarchitecture,
HW technologies.
➺ Is about making tradeoffs among design parameters optimized
for target applications.
➺ Should be mindful about technology trends, and their impacts
on comp. design.
➺ Classification of comp. arch. wrt different types of
parallelism they exploit.
➺ Reviews of trends of various system parameters.
➺ Overviews of computer design principles
➺ Exploitation of parallelism, locality, optimization for common
cases, etc.
➺ Processor execution time evaluation
47
Backup

48
Where is the Market?
Embedded
1200 1122 Desktop
Millions of Computers

Servers  It is estimated
1000 892 862 that there will be
800 between 15 and 20
billion devices
600 488 (small embedded
400 290 devices) with a
$900 Billion USD
200 93 114 135 129 131 market, which is
3 3 4 4 5
0 growing twice as
1998 1999 2000 2001 2002 fast as the PC
market.

➺ Embedded applications: mobile electronics, automobile


electronics, home electronics, communications and networking,
healthcare products, sensor networks, gaming hardware, defense
applications and many others!
49
Computer Architecture – Past and Now
➺ Computer architecture – Past
➺ Instruction Set Architecture (ISA) design
➺ i.e. decisions regarding:
➤ registers, memory addressing, addressing modes, instruction operands,
available operations, control flow instructions, instruction encoding

➺ Computer architecture – Now


➺ ISA design is less of a focus
➺ Meet specific requirements of the target machine
➺ Design to find a best tradeoff among performance, cost,
power, and availability, etc, optimized for target applications
➺ Consider ISA, microarchitecture, logic/circuit design,
implementation, etc.
Impacts from requirements of applications and technology.
1.3 Defining Computer Architecture 50
Dependability
➺ As feature size shrinks, computers fail more often.

➺ Module reliability
➺ Mean time to failure (MTTF)
➺ Mean time to repair (MTTR)
➺ Mean time between failures (MTBF) = MTTF + MTTR
➺ Availability = MTTF / MTBF
➺ = a ratio between service time and total life span

➺ To improve reliability: redundancy.

1.7 Dependability 51
Instruction Set Principles

1
Objective and Reading

➺ Objective
➺ understand issues and tradeoffs in instruction set design

➺ Reading
➺ Computer Architecture: A Quantitative Approach
➤ Appendix A
➺ recommended: Computer Organization and Design: The
Hardware/Software Interface
➤ Chapter 2

2
Abstractions
➺ Instruction set architecture
(ISA)
➺ The hardware/software interface Problem

➺ Defines storage, operations, etc Algorithm

➺ Abstraction helps to deal


Program/Language User

with complexity
Runtime System
➺ Hide lower-level detail (VM, OS, MM)
➺ Implementation ISA

➺ The details underlying the Microarchitecture


Logic
interface
➺ An ISA can have multiple
Circuits
Electrons
implementations
A.1 Introduction 3
Levels of Program Code
➺ High-level language
➺ Level of abstraction closer to problem domain
➺ Provides for productivity and portability
➺ Assembly language
➺ Textual representation of instructions
➺ Hardware representation
➺ Machine code - Binary digits (bits)
➺ Encoded instructions and data

➺ Compiler
➺ Translate HL prgm to assembly
➺ Assembler
➺ Translate assembly to machine code

A.1 Introduction 4
Program Execution Model
uP
➺ A computer is just a FSM
➺ States stored in registers, memory, PC, etc
➺ States changed by instruction execution Mem
➺ An instruction is executed in
➺ Fetch an instruction into CPU from memory
➺ Decode it to generate control signals
➺ Execute it (add, mult, etc) Fetch
➺ Write back output to reg or memory Decode
Execute
➺ Programs and data coexist in memory Writeback
➺ How to distinguish program from data?

A.1 Introduction 5
What Makes a Good ISA?
➺ Programmability
➺ Who does assembly programming these days?

➺ Performance/Implementability
➺ Easy to design high-performance implementations?
➺ Easy to design low-power/energy implementations?
➺ Easy to design low-cost implementations?

➺ Compatibility
➺ Easy to maintain as languages, programs evolve
➺ x86 (IA32) generations: 8086, 286, 386, 486, Pentium,
Pentium-II, Pentium-III, Pentium4, Core2, Core i7, ...

A.1 Introduction 6
Performance
➺ Execution time = IC * CPI * cycle time
➺ IC: instructions executed to finish a program
➺ Determined by program, compiler
➺ CPI: number of cycles needed for each
instruction
➺ Determined by compiler, micro-architecture
➺ Cycle time: inverse of clock frequency
➺ Determined by micro-arch. & technology
➺ Ideally optimize all three
➺ Their optimizations often against each other
➺ Compiler plays a significant role in reducing IC
A.1 Introduction 7
Instruction Granularity

opcode operand 1 operand 2 … operand n

o CISC (Complex Instruction Set Computing) ISAs


→ Big heavyweight instructions (lots of work per inst)
+ Low “inst/program” (IC)
– Higher “cycles/inst” (CPI) & “seconds/cycle” (Cycle T)

o RISC (Reduced Instruction Set Computer) ISAs


→ Minimalist approach to an ISA: simple inst only
+ Low CPI and “seconds/cycle” (Cycle T)
– Higher IC, but hopefully not as much
➤ Rely on compiler optimizations
A.1 Introduction 8
Classifying Architectures
➺ One important classification scheme is by the type of
addressing modes supported.
➺ Stack architecture: Operands implicitly on top of a stack.
(Early machines, Intel floating-point.)
➺ Accumulator architecture: One operand is implicitly an
accumulator (a special register).
➺ early machines
➺ General-purpose register arch.: Operands may be any of
a large (typically 10s-100s) # of registers.
➺ Register-memory architectures: One op may be memory.
➺ Load-store architectures: All ops are registers, except in
special load and store instructions.
A.2 ISA Classification 9
Illustrating Architecture Types
Assembly for C:=A+B:

A.2 ISA Classification 10


Number of Registers
➺ Registers have advantages
➺ faster than memory, good for compiler optimization, hold variables
➺ So have as many registers as possible?
➺ No – why?
➺ One reason that registers are faster:
➺ There are fewer of them – small is fast (hardware truism)
➺ Another reason: they are directly addressed
➺ More registers, means more bits per register in instruction
➺ Thus, fewer registers per instruction or larger instructions
➺ More registers means more saving/restoring
➺ Across function calls, traps, and context switches
➺ Trend toward more registers:
➺ 8 (x86) → 16 (x86-64), 16 (ARM v7) → 32 (ARM v8)
A.2 ISA Classification 11
Number of Operands
➺ A further classification is by the max. number of
operands, and # of operands that can be memory: e.g.,
➺ 2-operand (e.g. a += b)
➤ src/dest(reg), src(reg)
➤ src/dest(reg), src(mem) IBM 360, x86, 68k
➤ src/dest(mem), src(mem) VAX
➺ 3-operand (e.g. a = b+c)
➤ dest(reg), src1(reg), src2(reg) MIPS, PPC, SPARC, RSIC-V.
➤ dest(reg), src1(reg), src2(mem)
➤ dest(mem), src1(mem), src2(mem) VAX

➺ Types of General Purpose Register Computers


➺ register-register (load-store)
➺ register-memory
➺ memory-memory
A.2 ISA Classification 12
Memory Addressing
➺ Byte Addressing
➺ Each byte has a unique address
➺ Other addressing units
➺ Half-word: 16-bit (or 2 bytes)
➺ Word: 32-bit (or 4 bytes)
➺ Double word : 64-bit (or 8 bytes)
➺ Quad word: 128-bit (or 16 bytes)
➺ Two issues
➺ Alignment specifies whether there are any boundaries
for word addressing
➺ Byte order (Big Endian vs. Little Endian)
➤ specifies how multiple bytes within a word are mapped to
memory addresses
A.3 Memory Addressing 13
Memory Addressing
➺ Alignment
➺ Half word, words, double words begin at mod 2, mod 4, mod 8
boundaries

half word
#mod 2

Word
#mod 4

Double
word
#mod 8
Aligned if Addr mod size = 0
A.3 Memory Addressing 14
Memory Addressing
➺ Alignment
➺ Or there no alignment restrictions

half word

Word

Double
word

A.3 Memory Addressing 15


Memory Addressing
➺ Non-aligned memory references may cause multiple
memory accesses

➺ Consider a system in which memory reads return 4


bytes and a reference to a word spans a 4-byte
boundary: two memory accesses are required
➺ Complicates memory and cache controller design
➺ Assemblers typically force alignment for efficiency
A.3 Memory Addressing 16
Addressing Modes – How to Find Operands

A.3 Memory Addressing


Addressing Modes
➺ Addressing modes can reduce IC but
➺ At a cost of added CPU design complexity and/or
➺ Increase average CPI

➺ Example – memory indirect


Add R1, @(R3)
➺ equivalently,
load R2, Mem[R3]
load R4, Mem[R2]
Add R1, R1, R4

A.3 Memory Addressing 21


Which Addressing Modes to Support
➺ Support frequently used modes
➺ Make common case fast!

A.3 Memory Addressing 22


Displacement Value Distribution

add R4 100(R1) – 16 bits to be sufficient


SPEC CPU2000 on Alpha
A.3 Memory Addressing 23
Popularity of Immediate Operands

add R4 #3
SPEC CPU2000 on Alpha

A.3 Memory Addressing 24


Distribution of Immediate Values

add R4 #3 – 16 bits to be sufficient


SPEC CPU2000 on Alpha
A.3 Memory Addressing 25
Other Issues

➺ How to specify type and size of operands (A.4)


➺ Mainly specified in opcode – no separate tags for
operands
➺ Operations to support (A.5)
➺ simple instructions are used the most

➺ Control flow instructions (A.6)


➺ Branch, call/return more popular than jump
➺ Target address is typically PC-relative & register indirect
➺ Address displacements are usually <= 12 bits
➺ How to implement conditions for branches

26
Instruction Encoding
➺ Affects code size and implementation efficiency

➺ OpCode – Operation Code


➺ The instruction (e.g., “add”, “load”)
➺ Possible variants (e.g., “load byte”, “load word”…)

➺ Operands – source and destination


➺ Register, memory address, immediate

➺ Addressing Modes – Impacts code size


1. Encode as part of opcode (common in load-store architectures
which use a few number of addressing modes)
2. Address specifier for each operand (common in architectures which
support may different addressing modes)

A.7 Instruction Encodings 27


Instruction Encoding

length: 1 – 17 bytes

A.7 Instruction Encodings 28


Fixed vs Variable Length Encoding
➺ Fixed Length
➺ Simple, easily decoded
➺ Larger code size

➺ Variable Length
➺ More complex, harder to decode
➺ More compact, efficient use of memory
➤ Fewer memory references
➤ Advantage possibly mitigated by use of cache
➺ Complex pipeline: instructions vary greatly in both size and
amount of work to be performed

A.7 Instruction Encodings 29


Instruction Encoding
➺ Tradeoff between variable and fixed encoding is size of
program versus ease of decoding
➺ Must balance the following competing requirements:
➺ Support as many registers and addressing modes as possible
➺ Impact of the number of registers and addressing mode fields
on the average instruction size
➺ Desire to have instructions encoded into lengths that will be
easy to handle in a pipelined implementation
➤ Multiples of bytes than arbitrary # of bits

➺ Many desktop and server choose fixed-length


instructions
➺ ?
A.7 Instruction Encodings 30
Putting it Together
➺ Use general-purpose registers with load-store arch
➺ Addressing modes: displacement, immediate, register
indirect
➺ Data size: 8-, 16-, 32-, and 64-bit integer, 64-bit floating
➺ Simple instructions: load, store, add, subtract, …
➺ Compare: =, /=, <
➺ Fixed instruction for performance, variable instruction
for code size
➺ At least 16 registers

➺ Read section A9 to get an idea of RISC-V ISA.


➺ Useful for understanding following discussions on pipelining
31
Pitfalls

➺ Designing “high-level” instruction set features to


support a high-level language structure
➺ They do not match HL needs, or
➺ Too expensive to use
➺ Should provide primitives for compiler

➺ Innovating at instruction set architecture alone


without accounting for compiler support
➺ Often compiler can lead to larger improvement in
performance or code size

32
Compilers and Optimization

➺ Providing primitives to the compiler enables it to


work more efficiently
➺ Translate instructions in to Control and Dataflow
Graph (CDFG) and use graph analysis algorithms
to find potential optimizations
➺ Nodes (vertices) are “basic blocks”
➺ Directed edges are jumps in control flow

if / else while loop 33


Basic Blocks

➺ Sequence of instructions with


➺ No embedded branches (except at the end)
➺ No branch targets (except at the beginning
➺ Identified by the compiler as a potential target
for optimization
➺ Advanced CPUs can accelerate execution of basic
blocks in hardware

34
Example Optimizations
➺ Eliminate unreachable expressions
➺ Easy to find in the graph – no edges in
➺ No performance impact / reduces code size
➺ Find and identify common expressions
➺ E.g. a mathematical expression that is computed multiple
times
➺ Automatically remove subsequent computations if no
variables have changed

35
Example Optimizations (cont.)
➺ Simplify algebraic expressions
➺ X = 3*4 + y  X = 12 + y [“constant propagation”]
➺ X=Y  Z = Y + 5; [“variable propagation”]
Z = X + 5;
➺ X=Y*8  X = Y << 3 [“strength reduction”]
➺ Optimize loops
➺ Move loop invariant instructions outside of the loop
➺ Merge loops whenever possible
➺ Unroll loops if it is cheaper than tracking an iterator

36
Effect of Compiler Optimizations
Compiled with gcc for Pentium 4 under Linux

3 Relative Performance 140000 Instruction count


2.5 120000
100000
2
80000
1.5
60000
1
40000
0.5 20000
0 0
none O1 O2 O3 none O1 O2 O3

180000 Clock Cycles 2 CPI


160000
140000 1.5
120000
100000
1
80000
60000
40000 0.5
20000
0 0
none O1 O2 O3 none O1 O2 O3

(From P&H Ch. 2)


37
A.9 RISC-V Architecture

38
RISC Philosophy

➺ Keep the instruction set small and simple, in


order to build fast hardware

➺ Let software do complicated operations by


composing simpler ones
➺ give compiler freedom to optimize

A.9 RISC-V Architecture


What is RISC-V?
➺ Fifth generation of RISC ISA from UC Berkeley
➺ A high-quality, license-free, royalty-free RISC ISA
specification
➺ Experiencing rapid uptake in both industry and academia
➺ Both proprietary and open-source core implementations
➺ Supported by growing shared software ecosystem
➺ Appropriate for all levels of computing system, from
microcontrollers to supercomputers
➺ 32-bit, 64-bit, and 128-bit variants
➺ Standard maintained by non-profit RISC-V Foundation

A.9 RISC-V Architecture 40


Example Toolchain Output

riscv64-unknown-elf-gcc –O0 –S test.c

41
Example Toolchain Output

riscv64-unknown-elf-gcc –O0 –march=rv64imafd test.c –o test.o

riscv64-unknown-elf-objdump –d test.o
test.
o

42
What is RISC-V?

➺ Not "over-architecting” for


➺ a particular microarchitecture style (e.g., microcoded, in-
order, decoupled, out-of-order) or
➺ implementation technology (e.g., full-custom, ASIC,
FPGA), but which allows efficient implementation in any
of these
➺ RISC-V ISA includes
➺ A small base integer ISA, usable by itself as a base for
customized accelerators or for educational purposes, and
➺ Optional standard extensions, to support general-
purpose software development
➺ Optional customer extensions
A.9 RISC-V Architecture 43
RISC-V ISAs

➺ Base ISAs
➺ RV32I: base 32-bit integer instructions (32b registers)
➺ RV32E: base 32-bit integer instructions (32b registers)
➺ RV64I: base 64-bit integer instructions (64b registers)

➺ Extensions
➺ M: Integer Multiplication and Division
➺ A: Atomic Instructions
➺ F: Single-Precision Floating-Point
➺ D: Double-Precision Floating-Point
➺ C: Compressed Instructions (16 bit)

A.9 RISC-V Architecture 44


Register Operands
➺ Refer to “source” and “destination” registers
➺ “Register Transfer” model
➺ Data starts in registers
➺ Transformed by combinational logic
➺ Transferred back into registers
➺ Instructions are usually in the form:
➺ <mnemonic> <destination>, <source1>, <source2>
➺ <mnemonic> <source1>, <source2>
➺ <mnemonic> <destination>
➺ <mnemonic> <source1>
➺ Design Principle: Smaller is faster

45
RV32 Registers

➺ 32 32(b4)-bit registers: X0 – X31


➺ X0: constant 0
➺ X1: return address on a call

➺ 32 FP registers: f0 – f31
➺ optional

➺ FP status register (fsr), used for FP rounding


mode & exception reporting

A.9 RISC-V Architecture 46


RV32 Registers (Simplified)

➺ Defined convention for which registers are used


for what kinds of values
➺ “t#”  Temporary registers
➺ “s#”  Saved registers
➺ “a#”  Arguments
➺ Helps to ensure consistency
& supports collaboration
➺ Assume temporary registers can be
overwritten by any function
➺ Assume saved registers will still be
there after a function call
➺ Always pass function inputs in
argument registers

47
RV Data Types

➺ 8-bit bytes
➺ 16-bit half-word: common in C
➺ 32-bit words, integer, single-precision FP
➺ 64-bit double-words, integer double-precision FP

A.9 RISC-V Architecture 48


RV Addressing Modes

➺ Immediate: 12 bits
➺ Displacement: 12 bits
➺ Register indirect
➺ simulated by placing 0 for displacement

➺ Unaligned memory access is allowed


➺ but performance is low

A.9 RISC-V Architecture 49


RV Instruction Formats

A.9 RISC-V Architecture 50


RV Encoding

A.9 RISC-V Architecture


RV Load/Store Instructions

A.9 RISC-V Architecture


RV ALU Instructions

A.9 RISC-V Architecture


RV Control Flow Instructions

See App A and the RISC-V ISA Manual for more information

A.9 RISC-V Architecture


Backup

55
Types of Instructions

Operations supported by most ISAs


A.4 Types
A.5 Operations
of Instructions
in the Instructions 56
Instruction Distribution

Simple instructions dominate!


A.5 Operations in the Instructions 57
Control Flow Instructions

Conditional branches dominate!

A.6 Control Flow Instructions


SPEC CPU 2000 on Alpha
58
Conditional Branch Distances

4-8 bits can encode 90% branches!


SPEC CPU 2000 on Alpha

A.6 Control Flow Instructions 59


Branch Condition Evaluation

A.6 Control Flow Instructions 60


Types of Comparisons

SPEC CPU 2000 on Alpha


A.6 Control Flow Instructions 61
Computer Architecture
A Quantitative Approach, Fifth Edition

Pipelining – Basic Review

1
Objective and Reading

➜ Objective
➜Review of basic pipelining
architecture – how it
improves performance, and its issues
➜Introduce ILP

➜ Reading
➜Appendix C.1
➜Chapter 3.1

2
A Simplified View of Computers

CPU

I$ D$

Memory

Interconnect

HD DISP ... KB
3
C.1 Introduction
C.1 Introduction

4
Introduction
➜ Design Principle – exploit parallelism
➜ Pipelining become universal technique in 1985
➜ Overlaps execution of instructions
➜ Exploits Instruction Level Parallelism (ILP) Fetch
Decode
➜ Two main approaches to detect ILP Execute
➜ Hardware-based dynamic approaches Writeback
➜ Used in server and desktop processors
➜ Not used as extensively in PMD processors
➜ Compiler-based static approaches
➜ Not as successful outside of scientific applications

C.1 Introduction 5
Instruction Execution of RISC
➜ Initial State: PC is set to point to the first instruction

➜ For each instruction, perform the following 5 steps:


➜ Instruction Fetch (IF)
➜ Instruction Decode/Register Read (ID)
➜ Execution/Effective Address Calculation (EX)
➜ Memory Access (MEM)
➜ Write Back (WB)

C.1 Introduction 6
Instruction Execution
➜ Instruction Fetch:
➜ Send PC to memory, assert MemRead signal
➜ Instruction fetched from memory
➜ Place instruction in IR: IR  Mem[PC]
➜ Update PC to next instruction: PC  [PC] + 4

C.1 Introduction 7
Instruction Execution
➜ Instruction Decode:
➜ Instruction in IR decoded by control logic, instruction type and
operands determined
➜ Source operands read from general purpose register file, etc

C.1 Introduction 8
Instruction Execution
➜ Execute:
➜ ALU operates on operands prepared in previous cycle
➜ One of four functions depending upon opcode
➜ Memory Reference
➜ Form effective address from base register and immediate offset
➜ ALU Output  [A] + Imm
➜ Register-Register ALU Instruction
➜ ALU Output  [A] op [B]
➜ Register-Immediate ALU Instruction
➜ ALU Output  [A] op Imm
➜ Branch
➜ Compute branch target by adding Imm to PC
➜ ALU Output  [PC] + (Imm << 2)
➜ Evaluate the branch condition

C.1 Introduction 9
Instruction Execution
➜ Memory Access:
➜ For load instructions, read data from memory
➜ For store instructions, write data to memory

C.1 Introduction 10
Instruction Execution
➜ Write-back:
➜ Results written to destination register
➜ Results from mem read or ALU

C.1 Introduction 11
Instruction Execution – Example
add X3, X4, X5 ; X3[X4]+[X5]
➜ Source registers: X4, X5 Destination register: X3
➜ Instruction steps:
➜ Fetch: Fetch the instruction into IR and increment the program
counter
➜ Decode: Decode the instruction in IR to determine the
operation to be performed (add). Read the contents of
registers X4 and X5
➜ Execute: Compute the sum [X4] + [X5]
➜ Memory Access: No action, since there are no memory
operands
➜ Write-back: Write the result into register X3

C.1 Introduction 12
Instruction Execution – Example
ld X5, N(X7) ; X5[[X7]+N]
➜ Source register: X7 Destination register: X5
➜ Immediate value N is given in the instruction word
➜ Instruction steps:
Fetch: Fetch the instruction and increment the program counter
Decode: Decode the instruction in IR to determine the operation
to be performed (load). Read the contents of register X7
Execute: Add the immediate value N to the contents of X7
Memory Access: Use the sum N+[X7] as the effective address of
the source operand, read the contents of that location from
memory
Write-back: Write the data received from memory into register X5

C.1 Introduction 13
Instruction Execution – Example
sd X6, N(X8) ;Mem[N+[X8]][X6]
➜ Source registers: X6, X8 Destination register: None
➜ The immediate value N is given in the instruction word
➜ Instruction steps:
➜ Fetch: Fetch the instruction and increment the program
counter
➜ Decode: Decode the instruction in IR to determine the
operation to be performed (store). Read the contents of
registers X6 and X8.
➜ Execute: Compute the effective address N + [X8]
➜ Memory Access: Store the contents of register X6 into memory
location N + [X8]
➜ Writeback: No action
C.1 Introduction 14
Basic Pipeline

To improve performance, we can make circuit faster,


or use …

C.1 Introduction 15
• Pipeline clock
cycle
determined by
the slowest
stage.
• Pipeline registers add extra
overhead.

C.1 Introduction 16
Ideal Pipeline Performance
➜ Balanced pipeline (each stage has the same delay)
➜ Ignore overhead due to clock skew and pipeline registers
➜ Ignore pipeline fill and drain overheads

Number of pipeline stages

C.1 Introduction 17
Pipeline Performance
➜ Example: A program consisting of 500 instructions is executed on
a 5-stage processor. How many cycles would be required to
complete the program. Assume ideal overlap in case of pipelining.
➜ Without pipelining:
➜ Each instruction will require 5 cycles. There will be no overlap
amongst successive instructions.
➜ Number of cycles = 500 * 5 = 2500
➜ With pipelining:
➜ Each pipeline stage will process a different instruction every
cycle. First instruction will complete in 5 cycles, then one
instruction will complete in every cycle, due to ideal overlap.
➜ Number of cycles = 1*5 + (499*1) = 504
➜ Speedup with pipelining = 2500/504 = 4.96
C.1 Introduction 18
Pipeline Performance
➜ Problem: Consider a non-pipelined processor using the 5-
stage datapath with 1 ns clock cycle. Assume that due to
clock skew and pipeline registers, pipelining the processor
adds 0.2 ns of overhead to the clock cycle. How much
speedup can we expect to gain from pipelining? Assume a
balanced pipeline and ignore the pipeline fill and drain
overheads. (A similar ex. in the book)
➜ Solution:
➜ Without pipelining: Clock period = 1 ns, CPI = 5
➜ With pipelining: Clock period = 1 + 0.2 = 1.2 ns, CPI = 1

C.1 Introduction 19
Pipeline Performance
➜ The potential increase in performance resulting from
pipelining is proportional to the number of pipeline
stages
➜ However, this increase would be achieved only if
➜ all pipeline stages require the same time to complete, and
➜ there is no interruption throughout program execution

➜ Unfortunately, this is not true


➜ there are times when an instruction cannot proceed from one
stage to the next in every clock cycle

C.1 Introduction 20
Pipeline Performance – cont’d

➜ Pipeline stages need to be balanced


➜ the clock is fixed by the slowest stage.

➜ Ideally, every stage has same latency


➜ In reality, stages are unbalanced
➜ memory accesses are much slower

➜ Example: MEM has 10 ns latency, other stages have 2 ns


latency
➜ clock cycle time = 10 ns
➜ Reducing memory latency is critical  memory hierarchy

➜ Other factor: pipeline stalls

21
Pipeline Stalls

Ij

Ij+1

Ij+2

➜ Assume that Instruction Ij+1 is stalled in the decode stage for two extra
cycles
➜ This will cause Ij+2 to be stalled in the fetch stage, until Ij+1 proceeds
➜ New instructions cannot enter the pipeline until Ij+2 proceeds past the
fetch stage after cycle 5 => execution time increases by two cycles

C.1 Introduction 22
Summary
➜ Instruction executes in a sequence of stages
➜ Multiple instructions can execute at the same time in
different stages
➜ Instruction-level parallelism (ILP)

➜ Pipelining architecture exploits ILP


➜ Ideal pipelining performance improvement == # of
pipeline stages
➜ In reality, it suffers due to hazards
➜ More discussions on it are scheduled

23
Memory Hierarchy Design

1
Recall Pipeline Performance
➺ Pipeline stages need to be balanced
➺ the clock is fixed by the slowest stage.
➺ Ideal case, every stage has same latency
➺ In reality, stages are unbalanced
➺ memory accesses are much slower
➺ Example: MEM 10 ns latency, other stages 2 ns
latency
➺ clock cycle time = 10 ns
➺ How do reduce MEM stage latency?
➺ Memory hierarchy
A Simplified View of Computers

CPU

I$ D$

Memory
Main Memory

Interconnect

HD DISP ... KB
Objectives and Reading
➺ Objectives
➺ Understand memory hierarchy organizations and their
impacts on performance
➺ Evaluate performance tradeoffs of different memory
hierarchy organizations
➺ Reading
➺ Computer Architecture: A Quantitative Approach
➤ Appendix B, Chapter 2
➺ Computer Organization and Design: The
Hardware/Software Interface
➤ Chapter 5

4
Memory Technology – Overview
➺ Static RAM (SRAM)
➺ 0.5ns – 2ns, $2000 – $5000 per GB
➺ Dynamic RAM (DRAM)
➺ 20ns – 30ns, $10 – $50 per GB
➺ Flash – non-volatile
➺ 20 – 100 us, 5-10x cheaper than DRAM
➺ Magnetic disk
➺ 5ms – 20ms, $0.20 – $2 per GB

➺ Ideal memory
➺ Access time of SRAM,
➺ Capacity and cost/GB of disk
5
The “Memory Wall” Problem

Processor mem accesses/sec vs DRAM accesses/sec


6
Computer Energy Usage

7
The “Memory Wall” – A Multi-Core Case
➺ Aggregate peak bandwidth grows with # cores:
➺ Intel Core i7 can generate two references per core per clock
➺ Four cores and 3.2 GHz clock
25.6 billion 64-bit data references/second +
12.8 billion 128-bit instruction references/second
= 409.6 GB/sec
➺ DRAM bandwidth is only 6% of this (25 GB/s)
➺ How does memory meet processor bandwidth demand?
• Multi-port, pipelined caches
• Two levels of cache per core
• Shared third-level cache on chip

2.1 Introduction 8
Principle of Locality – Review
➺ Temporal locality
➺ Programs often access a small proportion of their
address space at any time
➺ Items accessed recently are likely to be accessed again
soon
➺ e.g., instructions and variables in a loop

➺ Spatial locality
➺ Items near those accessed recently are likely to be
accessed soon
➺ E.g., sequential instruction access, array data

B.1 Introduction 9
Principle of Locality – Review
➺ Identify Temporal and spatial locality

int sum = 0;
int x[1000];

for (int c = 0; c < 1000; c++) {


sum += x[c];
x[c] = 0;
}

B.1 Introduction 10
Memory Hierarchy – Basic Idea
➺ Ideally memory = unlimited capacity with low latency
➺ Fast memory technology is more expensive per bit than
slower memory
➺ Solution: organize memory system into a hierarchy
➺ Entire addressable memory space available in largest, slowest
memory
➺ Incrementally smaller and faster memories, each containing a
subset of the memory below it, proceed in steps up toward the
processor
➺ Temporal and spatial locality ensures that nearly all
references can be found in smaller memories
➺ Gives the illusion of a large, fast memory being presented to
the processor

2.1 Introduction 11
Memory Hierarchies

Mobile devices

Desktop

Servers

2.1 Introduction 12
Energy Consumptions

Song Han, FPGA’17 talk, “Deep Learning – Tutorial and Recent Trends”

More cache accesses lead to better efficiency


2.1 Introduction
Basic Cache Organizations
Memory Hierarchy Questions
➺ Data transferred between cache & memory are in
blocks
➺ Block placement - where can a block be placed in
the upper level?
➺ Block identification – how to find a block in the
upper level?
➺ Block replacement - which block should be
replaced on a miss?
➺ Write strategy – how to handle writes?
B.1 Introduction 2
Basics
➺ When a word is found in cache --> cache hit.
➺ When a word is not found in the cache, a miss occurs:
➺ Fetch word from lower level in hierarchy does not reduce
latency
➺ Lower level may be another cache or the main memory
➺ Also fetch the other words contained within the block
➤ Takes advantage of spatial locality
➺ Place block into cache in any location within its set, determined
by address

word block
Main
CPU Cache
Memory

B.1 Introduction 3
Direct Mapped Cache
➺ Only one choice
➺ cache index = (Block address) MOD (#Blocks in cache)

◼ #Blocks is a
power of 2
◼ Use low-order
address bits to
access bytes in
a block

B.1 Introduction 5
Tags and Valid Bits
➺ One cache line  Multiple memory blocks
➺ Cache line is aka cache block

➺ How do we know which particular memory block


is stored in a cache location?
➺ For write, store tag bits as well as the data
➺ For read, match tags in addressed and that stored in $

➺ What if there is no data in a location?


➺ Valid bit: 1 = present, 0 = not present
➺ Initially 0

B.1 Introduction 6
Address Subdivision
◼ Example: 1024-word cache (32-bit words → 4 kB cache)
◼ Cache has 210 words, so 10 bits are used for the index

◼ 32 – 10 – 2 = 20 bits for the tag [2 bits for byte offset]

◼ If there is data present at the address specified (V = 1) and the


tags match, this is a cache hit
◼ If there is data present at the address specified (V = 1) but the
tages do NOT match, this is a cache miss
◼ If there is no data (V = 0) this is also a miss

7
Address Subdivision

8
Another Cache Example
◼ Direct mapped cache (assume 32-bit address space)
◼ 2048 blocks, each holds 32 bytes of data
◼ How many bytes can we store in the cache?

9
Another Cache Example
◼ Direct mapped cache (assume 32-bit address space)
◼ 2048 blocks, each holds 32 bytes of data
◼ How many bytes can we store in the cache?
◼ 2048 x 32 bytes = 65536 bytes = 64K cache

10
Another Cache Example
◼ Direct mapped cache (assume 32-bit address space)
◼ 2048 blocks, each holds 32 bytes of data
◼ How many bytes can we store in the cache?
◼ 2048 x 32 bytes = 65536 bytes = 64K cache

◼ What are the sizes of the index, tag, byte fields?

11
Another Cache Example
◼ Direct mapped cache (assume 32-bit address space)
◼ 2048 blocks, each holds 32 bytes of data
◼ How many bytes can we store in the cache?
◼ 2048 x 32 bytes = 65536 bytes = 64K cache

◼ What are the sizes of the index, tag, byte fields?


◼ 2048 blocks = 211 blocks

◼ 32 Bytes = 25 bytes  assuming byte-level addressing

◼ 32 – 11 – 5 = 16 bits for tag

◼ How many bits per cache line do we actually store?

12
Another Cache Example
◼ Direct mapped cache (assume 32-bit address space)
◼ 2048 blocks, each holds 32 bytes of data
◼ How many bytes can we store in the cache?
◼ 2048 x 32 bytes = 65536 bytes = 64K cache

◼ What are the sizes of the index, tag, byte fields?


◼ 2048 blocks = 211 blocks

◼ 32 Bytes = 25 bytes  assuming byte-level addressing

◼ 32 – 11 – 5 = 16 bits for tag

◼ How many bits per cache line do we actually store?


◼ 16 bits + (32 * 8) bits + 2 = 274 bits / line [offset not stored]

◼ What is the actual total size of the cache?


13
Another Cache Example
◼ Direct mapped cache (assume 32-bit address space)
◼ 2048 blocks, each holds 32 bytes of data
◼ How many bytes can we store in the cache?
◼ 2048 x 32 bytes = 65536 bytes = 64K cache

◼ What are the sizes of the index, tag, byte fields?


◼ 2048 blocks = 211 blocks

◼ 32 Bytes = 25 bytes  assuming byte-level addressing

◼ 32 – 11 – 5 = 16 bits for tag

◼ How many bits per cache line do we actually store?


◼ 16 bits + (32 * 8) bits + 2 = 274 bits / line [offset not stored]

◼ What is the actual total size of the cache?


◼ 274 x 2048 = 68.5 KiB, or ~7% overhead 14
Block vs Byte Offset
◼ Direct mapped cache (assume 32-bit address space), but now
consider word alignment
◼ 2048 blocks, each holds 32 bytes of data (as before)
◼ 32 bytes @ 4 bytes / word = 8 words

◼ Address division:
◼ 211 blocks → 11 bit index

◼ 23 words / block = 3 bit block offset [allows us to select a word]

◼ 22 bytes / word → ignore 2 LSBs of address [used after loading desired word
from block]
◼ 32 – 11 – 3 – 2 = 16 bit tag

◼ Same result as before, just clarifying usage of offset field

15
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?

16
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words?

17
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words

18
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words

◼ How many blocks?

19
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words

◼ How many blocks? 4096 / 4 words/block = 1024 (210) blocks

20
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words

◼ How many blocks? 4096 / 4 words/block = 1024 (210) blocks

◼ How many bits in the offset?

21
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words

◼ How many blocks? 4096 / 4 words/block = 1024 (210) blocks

◼ How many bits in the offset? 4-word blocks, 22

22
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words

◼ How many blocks? 4096 / 4 words/block = 1024 (210) blocks

◼ How many bits in the offset? 4-word blocks, 22

◼ How many bits in the tag?

23
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words

◼ How many blocks? 4096 / 4 words/block = 1024 (210) blocks

◼ How many bits in the offset? 4-word blocks, 22

◼ How many bits in the tag? 32 – 10 (210 blocks) – 2 (22 words/block) – 2


(22 bytes/word) = 18-bit tag

24
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words

◼ How many blocks? 4096 / 4 words/block = 1024 (210) blocks

◼ How many bits in the offset? 4-word blocks, 22

◼ How many bits in the tag? 32 – 10 (210 blocks) – 2 (22 words/block) – 2


(22 bytes/word) = 18-bit tag
◼ How many actual bits per block?

25
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words

◼ How many blocks? 4096 / 4 words/block = 1024 (210) blocks

◼ How many bits in the offset? 4-word blocks, 22

◼ How many bits in the tag? 32 – 10 (210 blocks) – 2 (22 words/block) – 2


(22 bytes/word) = 18-bit tag
◼ How many actual bits per block? (4 words) * (32 bits/word) + 18 (bits of
tag) + 2 (dirty + valid) = 128 + 18 + 2 = 148 bits / block

26
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words

◼ How many blocks? 4096 / 4 words/block = 1024 (210) blocks

◼ How many bits in the offset? 4-word blocks, 22

◼ How many bits in the tag? 32 – 10 (210 blocks) – 2 (22 words/block) – 2


(22 bytes/word) = 18-bit tag
◼ How many actual bits per block? (4 words) * (32 bits/word) + 18 (bits of
tag) + 2 (dirty + valid) = 128 + 18 + 2 = 148 bits / block
◼ How large is the memory?

27
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words

◼ How many blocks? 4096 / 4 words/block = 1024 (210) blocks

◼ How many bits in the offset? 4-word blocks, 22

◼ How many bits in the tag? 32 – 10 (210 blocks) – 2 (22 words/block) – 2


(22 bytes/word) = 18-bit tag
◼ How many actual bits per block? (4 words) * (32 bits/word) + 18 (bits of
tag) + 2 (dirty + valid) = 128 + 18 + 2 = 148 bits / block
◼ How large is the memory? 148 bits / block * 1024 blocks =
151,552 bits = 18,944 bytes = 18.944 KB = 18.5 KiB

28
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words

◼ How many blocks? 4096 / 4 words/block = 1024 (210) blocks

◼ How many bits in the offset? 4-word blocks, 22

◼ How many bits in the tag? 32 – 10 (210 blocks) – 2 (22 words/block) – 2


(22 bytes/word) = 18-bit tag
◼ How many actual bits per block? (4 words) * (32 bits/word) + 18 (bits of
tag) + 2 (dirty + valid) = 128 + 18 + 2 = 148 bits / block
◼ How large is the memory? 148 bits / block * 1024 blocks =
151,552 bits = 18,944 bytes = 18.944 KB = 18.5 KiB
◼ What is the storage overhead?
29
Cache Sizing Example
◼ How many total bits are required for a direct-mapped cache
with 16 KiB data and 4-word blocks, assuming 32-bit
addresses? What is the storage overhead?
◼ How many words? 16 KiB / 4 Bytes per word → 4096 (212) words

◼ How many blocks? 4096 / 4 words/block = 1024 (210) blocks

◼ How many bits in the offset? 4-word blocks, 22

◼ How many bits in the tag? 32 – 10 (210 blocks) – 2 (22 words/block) – 2


(22 bytes/word) = 18-bit tag
◼ How many actual bits per block? (4 words) * (32 bits/word) + 18 (bits of
tag) + 2 (dirty + valid) = 128 + 18 + 2 = 148 bits / block
◼ How large is the memory? 148 bits / block * 1024 blocks =
151,552 bits = 18,944 bytes = 18.944 KB = 18.5 KiB
◼ What is the storage overhead? 18.5 KiB to store 16 KiB = 15.6%
30
Example: Larger Block Size
➺ Cache: 64 blocks, 16 bytes/block
➺ To what cache block number does address 1200
(decimal) map?
➺ Assume 32-bit address
➺ Byte-addressable

V tag block 0
V tag block 1
address …
32

V tag block 63

B.1 Introduction 31
Example: Larger Block Size
➺ Cache: 64 blocks, 16 bytes/block
➺ To what block number does address 1200 map?
➺ Block address = 1200/16 = 75
➺ Block offset value = 1200 modulo 16 = 0
➺ Cache block index = 75 modulo 64 = 11

31 10 9 4 3 0
Tag Index Offset
22 bits 6 bits 4 bits

B.1 Introduction 32
Cache Example
➺ 8-blocks, 1 word/block, direct mapped
➺ Initial state

Index V Tag Data


000 0
001 0
010 0
011 0
100 0
101 0
110 0
111 0

Gray area is what’s actually included in cache


B.1 Introduction 33
Cache Example
Word addr Binary addr Hit/miss Cache block
1 22 10 110 Miss 110

Index V Tag Data


000 0
001 0
010 0
011 0
100 0
101 0
compulsory
110 1 10 Mem[10110]
miss
111 0

B.1 Introduction 34
Cache Example
Word addr Binary addr Hit/miss Cache block
2 26 11 010 Miss 010

Index V Tag Data


000 0
001 0
010 1 11 Mem[11010]
011 0
100 0
101 0
110 1 10 Mem[10110]
111 0

B.1 Introduction 35
Cache Example
Word addr Binary addr Hit/miss Cache block
3 16 10 000 Miss 000
4 3 00 011 Miss 011
5 16 10 000 Hit 000

Index V Tag Data


000 1 10 Mem[10000]
001 0
010 1 11 Mem[11010]
011 1 00 Mem[00011]
100 0
101 0
110 1 10 Mem[10110]
111 0

B.1 Introduction 36
Cache Example
Word addr Binary addr Hit/miss Cache block
18 10 010 Miss 010

Index V Tag Data


000 1 10 Mem[10000]
001 0
010 1 10 Mem[10010]
011 1 00 Mem[00011]
100 0
101 0
110 1 10 Mem[10110]
111 0

B.1 Introduction 37
Cache Misses

➺ On cache hit, CPU proceeds normally


➺ On cache miss
➺ Stall the CPU
➺ Fetch a block from next level of mem. hierarchy
➺ Instruction cache miss
➤ Restart instruction fetch
➺ Data cache miss
➤ Complete data access

B.1 Introduction 38
Cache Misses
➺ Miss rate
➺ Fraction of cache access that result in a miss

➺ Miss Penalty – time to access lower level memory

➺ Causes of misses
➺ Compulsory
➤ First reference to a block
➺ Capacity
➤ Blocks discarded and later retrieved due to cache capacity limit
➺ Conflict
➤ Program makes repeated references to multiple addresses from different
blocks that map to the same location in the cache
➤ Only happen in direct-mapped or set associative caches

2.1 Introduction 39
Measuring Cache Performance
➺ CPU time = (CPU cycles + mem stall cycles) * cycle time
➺ CPU cycles = IC*Ideal CPI (CPU cycles under cache hits)
➺ Memory stall cycles = additional cycles for handling cache
misses
latency of accessing
lower-level memory

Stall cycles per instruction


B.1 Introduction 40
Cache Performance – Example
➺ Given
➺ I-cache miss rate = 2%
➺ D-cache miss rate = 4%
➺ Miss penalty = 100 cycles
➺ Base CPI (ideal cache) = 2 cycles (aka ideal CPI)
➺ Load & stores are 36% of instructions
➺ Stall cycles per instruction
➺ I-cache: 1 × 0.02 × 100 = 2
➺ D-cache: 0.36 × 0.04 × 100 = 1.44
➺ Actual CPI = 2 + 2 + 1.44 = 5.44
➺ Ideal CPU is 5.44/2 =2.72 times faster
B.1 Introduction 41
Cache Performance - Average Access Time

➺ Hit time is also important for performance


➺ Average memory access time (AMAT)

AMAT = Hit time + Miss rate × Miss penalty

➺ Example
➺ CPU with 1ns clock cycle time, hit time = 2 cycle, miss
penalty = 20 cycles, I-cache miss rate = 5%
➺ AMAT = 2 + 0.05 × 20 = 3 cycles

B.2 Cache Performance 42


Performance Summary
➺ As CPU performance increased
➺ Miss penalty becomes more significant
➺ Increasing clock rate, and decreasing base CPI
➺ Memory stalls lead to more CPU cycles
➺ Greater proportion of time spent on memory stalls
➺ Cannot neglect cache behavior when evaluating
system performance
➺ What does Amdahl’s law tell us?

How to reduce memory stalls?

B.2 Cache Performance 43


Associative Caches – Reduce Conflict Misses
➺ Fully associative
➺ Allow a given block to go in any cache entry
➺ Requires all entries to be searched at once
➺ Comparator per entry – expensive (area)
➺ n-way set associative
➺ Each set contains n entries
➺ Block address determines the set
➤ Cache index = (Block address) MOD (#Sets in cache)
➺ Search all entries in a given set at once for correct one
➺ n comparators (less expensive than fully associative)
➺ Direct-mapped = 1-way associative

B.1 Introduction 44
Associative Cache Example
Fully Associative Direct Mapped 2-way Associative

B.1 Introduction
4-way Associative Cache Organization

3
Associativity Example
➺ Compare 4-block caches, block size = 1 byte
➺ Direct mapped, 2-way set associative, fully associative
➺ Sequence of addresses: 0, 8, 0, 6, 8

➺ Direct mapped

Address 0 Address 8 Address 0 Address 6 Address 8


0 Mem[0] Mem[8] Mem[0] Mem[0] Mem[8]
1
2 Mem[6] Mem[6]
3

5 misses (their types?)


B.1 Introduction 47
Associativity Example

➺ 2-way set associative

Address 0 Address 8 Address 0 Address 6 Address 8


Mem[0] Mem[0] Mem[0] Mem[0] Mem[8]
0
Mem[8] Mem[8] Mem[6] Mem[6]
1

B.1 Introduction 48
Associativity Example

➺ Fully associative

Address 0 Address 8 Address 0 Address 6 Address 8


Mem[0] Mem[0] Mem[0] Mem[0] Mem[0]
Mem[8] Mem[8] Mem[8] Mem[8]
Mem[6] Mem[6]

B.1 Introduction 49
Spectrum of Associativity
➺ For a cache with
8 entries

B.1 Introduction 50
Size of Tags vs Associativity
➺ No. address bits = 32
➺ Block size = 4 bytes
➺ Cache size = 16KB →
➺ Direct mapping
31 14 13 2 1 0
➺ Tag bits = 18x212
➺ Comparators = 1
18 bits
➺ 4-way set-associative
31 12 11 2 1 0
➺ Tag bits= 4x210x20 = 20x212
20 bits
➺ Comparators = 4
➺ Fully associative
31 2 10
➺ Tag bits = 30x212
➺ Comparators = 212
30 bits

B.1 Introduction 51
Size of Tags vs Associativity
➺ Increasing associativity requires
➺ More tag bits per cache block
➺ More comparators, each of which is more complex
➤ higher hit time, larger circuits
➺ The choice among direct, set-associative and
fully-associative mapping in any memory
hierarchy will depend on
➺ Miss rate vs cost of implementing associativity, both in
time and in extra hardware overhead

AMAT = Hit time↑ + (Miss rate↓) × Miss penalty


B.1 Introduction 52
Replacement Policy
➺ Direct mapped: no choice
➺ Set associative
➺ Prefer non-valid entry, if there is one
➺ Otherwise, choose among entries in the set
➺ Least-recently used (LRU)
➺ Choose the one unused for the longest time
➤ Simple for 2-way, manageable for 4-way, too hard otherwise
➤ FIFO approximates LRU – replace the oldest block

➺ Random
➺ Gives approximately the same performance as LRU for
high associativity
B.1 Introduction 53
Write Policy – Write-Through
➺ Update cache and memory
together
CPU
➺ Cache and memory data remain the
same
➺ Easier to implement Cache
➺ Writes take longer – wait for mem
update to complete
➺ e.g., if base CPI = 1, 10% of Memory
instructions are stores, memory write
latency is 100 cycles
➤ Effective CPI = 1 + 0.1×100 = 11

B.1 Introduction 54
Write-Through with Write Buffer

➺ Write buffer – holds data waiting CPU


to be written to memory
➺ CPU write to cache & write buffer
➺ Then, CPU continues immediately Cache
➺ Only stalls on write if write buffer
is full
WBuf
➺ Evict write buffer to memory
➺ Write buffer is freed when write-
to-memory is finished
Memory

B.1 Introduction 55
Write Policy – Write-Back
➺ Just update the block in cache
➺ Keep track of whether each block is CPU
dirty - dirty bits
➺ Inconsistent data in $ and Mem
➺ When a dirty block is replaced, Cache
write it back to memory
➺ Write speed is faster Memory
➺ One low-level memory update for
multiple cache writes.
➺ Energy saving.
➺ Write buffer can also be used
B.1 Introduction 56
Write Allocation
➺ What should happen on a write miss?
➺ Write allocate
➺ No-write allocate

➺ Alternatives for write-through


➺ Write-allocate: fetch the block, then write-through
➺ No-write allocate: update memory w/o fetching the
block
➺ For write-back
➺ Usually fetch the block – write-allocate
➺ Act like read misses

B.1 Introduction 57
Write Miss Policies – Example

write M[100]
write M[100]
read M[200]
write M[200]
write M[100]

➺ Assume fully associative cache


➺ Find number of misses for the following write policies
➺ Write-through cache with No-write allocate
➺ Write-back cache with Write allocate

B.1 Introduction 58
Handling Writes

Allocate Non-Allocate

• Allocate on write miss • Write to mem on


• flush dirty blocks only miss
Write-Back when replaced • no cache block
• write on hit only fetch
accesses cache

• Write to mem on
• Allocate on write miss
miss
Write-Through • Write to cache & mem
• no cache block
in parallel
fetch

59
Cache Performance – Example

Base CPI = 1,
cycle time = 1ns
Hit time(DM) = 1 cycle
Miss rate (DM) = 0.021
Hit time (2-way) = 1.35 cycles
Miss rate (2-way) = 0.019
Miss penalty = 150 cycles
Avg mem. req/inst = 1.4

Which cache is faster using


AMAT?

B.1 Introduction 60
Cache Performance – Example
To determine **which cache is faster using CPU time**, we need to
calculate the **total CPU execution time**. The relevant formula is:
\[
\text{CPU Time} = (\text{Instructions} \times \text{CPI}_{\text{effective}})
\times \text{Cycle Time}
Base CPI = 1, \]
Where:
- **CPI\(_{\text{effective}}\)** = Base CPI + Memory Stalls
cycle time = 1ns - **Memory Stalls per Instruction** = Avg. Mem. Requests per Instruction
× AMAT – Hit Time
Hit time(DM) = 1 cycle Let's walk through the calculations for both caches: **Direct-Mapped
(DM)** and **2-Way Set Associative**.
Miss rate (DM) = 0.021 ---
### **1. Direct-Mapped Cache CPU Time Calculation**
- **Base CPI**: 1
Hit time (2-way) = 1.35 cycles
- **Hit Time (DM)**: 1 cycle
- **Miss Rate (DM)**: 0.021
Miss rate (2-way) = 0.019 - **Miss Penalty**: 150 cycles
- **Avg. Memory Requests per Instruction**: 1.4
Miss penalty = 150 cycles - **Cycle Time**: 1 ns
#### **AMAT (Direct-Mapped)**:
\[
Avg mem. req/inst = 1.4 \text{AMAT} = 1 + (0.021 \times 150) = 1 + 3.15 = 4.15 \, \text{cycles}
\]
#### **Memory Stalls per Instruction**:
\[
Which cache is faster using
\text{Memory Stalls} = 1.4 \times (4.15 - 1) = 1.4 \times 3.15 = 4.41 \,
\text{cycles}
\]
CPU time? #### **CPI\(_{\text{effective}}\) for DM**:
\[
\text{CPI}_{\text{effective}} = 1 + 4.41 = 5.41 \, \text{cycles/instruction}
\]
B.1 Introduction --- 61
Cache Performance – Exercise

Base CPI = 1,
cycle time = 1ns
Hit time(DM) = 1 cycle
Miss rate (DM) = 0.021
Hit time (2-way) = 1.35 cycles
Miss rate (2-way) = 0.019
Miss penalty = 200 cycles
Avg mem. req/inst = 1.4

Which cache is faster?

B.1 Introduction 62
Basic Cache Optimizations
Cache Performance – Review

AMAT = Hit time + Miss rate × Miss penalty


miss time
Any optimization should consider its impact on all three
factors
• Hit time – often determines clock cycle time
• Miss time – impact on pipeline stalls
• Miss rate
• Miss penalty

B.1 Introduction 2
Six Basic Cache Optimizations
➺ Larger block size
➺ Reduces compulsory misses
➺ Increases capacity and conflict misses, increases miss penalty
➺ Larger total cache capacity to reduce miss rate
➺ Increases hit time, increases power consumption
➺ Higher associativity
➺ Reduces conflict misses
➺ Increases hit time, increases power consumption
➺ Multi-level cache to reduce miss penalty
➺ Reduces overall memory access time
➺ Giving priority to read misses over writes
➺ Reduces miss penalty
➺ Avoiding address translation in cache indexing
➺ Reduces hit time

B.3 Six Basic Cache Optimizations 3


Optimization 1 – Larger Block Size
➺ Reduce compulsory misses due to spatial locality
➺ May increase conflict/capacity misses
➺ Increase miss penalty

CPU

Cache

Memory

B.3 Six Basic Cache Optimizations 4


Optimization 1 – Larger Block Size
➺ Reduce compulsory misses due to spatial locality
➺ May increase conflict/capacity misses
➺ Increase miss penalty

AMAT
B.3 Six Basic Cache Optimizations 5
Optimization 1 – Block Size Selection
➺ Determined by lower level memory

➺ High latency and high bandwidth – larger block size

➺ Low latency and low bandwidth – smaller block size

CPU
bandwidth = Bytes/s

Cache

bus bandwidth
Memory Mem bandwidth

B.3 Six Basic Cache Optimizations 6


Optimization 2 – Larger Cache
➺ Reduce capacity misses
➺ Increases hit time
➺ Increases power consumption

B.3 Six Basic Cache Optimizations 7


Optimization 3 – Higher Associativity
➺ Reduces conflict misses
➺ Increases hit time
➺ Increases power consumption

B.3 Six Basic Cache Optimizations AMAT 8


Optimization 3 – Higher Associativity
➺ Higher Associativity --> higher hit time
➺ Larger miss penalty rewards higher associativity
➺ Example:

Assume 4KB cache


AMAT(1-way) = 1 + 0.025 * miss penalty
AMAT(2-way) = 1.36 + 0.016 * miss penalty
Consider
1. miss penalty = 25 cycles
2. miss penalty = 100 cycles

B.3 Six Basic Cache Optimizations 9


Optimization 4 – Multilevel Cache
➺ L1 cache: small to keep hit time fast
CPU
➺ L2 cache: capture as many L1 misses
& reduce miss penalty for L1$
1-2 cycles L1$
➺ Local miss rates: miss rates of L1, L2

➺ Global miss rates = 20-30 cycles L2$


miss rate(L1) * miss rate(L2)

~200 cycles Memory

B.3 Six Basic Cache Optimizations 10


Optimization 4 – Multilevel Cache
CPU
HT: hit time
MR: miss rate
MP: miss penalty 1-2 cycles L1$
AMAT = HT(L1) + MR(L1) * MP(L1)
MP(L1) = HT(L2) + LMR(L2) * MP(L2) 20-30 cycles L2$
Mem Stall cycles/inst =
misses/inst(L1) * HT(L2) + misses/inst(L2) *MP(L2)
misses/inst(L1) = M Accesses/inst * MR(L1)
misses/inst(L2) = M Accesses/inst * GMR(L2) ~200 cycles Memory

global miss rate


B.3 Six Basic Cache Optimizations 11
Optimization 4 – Multilevel Cache

Among 1000 mem accesses, L1 misses = 40, L2 misses = 20


Q1: Local/global miss rates?

HT(L1) = 1 cycle
HT(L2) = 10 cycles
MP(L2) = 200 cycles,
Avg M Access/inst = 1.5

Q2: AMAT?
Q3: Avg stall cycles/inst?

B.3 Six Basic Cache Optimizations 12


Optimization 4 – Multilevel Cache
➺ L1 hit time affects CPU speed
➺ Prefers small and fast L1

➺ Performance of L2 affects L1 miss penalty


➺ Prefers large L2 with higher associativity
➺ A little reduction in miss rate → big impacts on L2 speed due to
large L2 miss penalty

➺ See example in the book for the impact of higher


associativity on miss rates/AMAT

B.3 Six Basic Cache Optimizations 13


Optimization 4 – Multilevel Cache
➺ Impact of L2 cache size on execution speed

Base line: 8192KB L2


cache with 1 cycle hit
time

B.3 Six Basic Cache Optimizations 14


Optimization 5 – Giving Priority to Read
Misses over Writes
➺ Reduce miss penalty. See example below.

➺ Write-through cache with write buffers suffers from


RAW conflicts with main memory reads on cache misses:
➺ Write buffer holds updated data needed for the read.
➺ Alt #1 – wait for the write buffer to empty, increasing read miss
penalty (in old MIPS 1000 by 50% ).
➺ Alt #2 – Check write buffer contents before a read; if no
conflicts, let the memory read go first.

B.3 Six Basic Cache Optimizations 15


Optimization 5 – Giving Priority to Read
Misses over Writes
➺ Reduce miss penalty. See example below.

➺ In a write-back cache, suppose a read miss causes a dirty


block to be replaced
➺ Alt #1 – write the dirty block to memory first, then read
memory.
➺ Alt #2 – copy the dirty block to a write buffer, read the new
block, then write the dirty to memory.

B.3 Six Basic Cache Optimizations 16


Optimization 6 – Avoid Address Translation
during Cache Indexing to Reduce Hit Time

CPU Virtual Address


word
Cache
block
Memory Physical Address
page

Hard Disk

B.3 Six Basic Cache Optimizations 17


Optimization 6 – Avoid Address Translation
during Cache Indexing to Reduce Hit Time

CPU CPU CPU page


Virtual address offset
VA VA
(VA)
TLB Cache TLB L1 Cache

Physical Address VA PA Tag


=
(PA)
Cache TLB L2 Cache Hit?
PA PA PA
Mem Mem Mem

Conventional Virtually Virtually addressed


organization addressed cache physically tagged
B.3 Six Basic Cache Optimizations 18
Optimization 6 – Avoid Address Translation
during Cache Indexing to Reduce Hit Time

• Cache size limited by


the page size.
• Increase size by higher
associativity.

B.3 Six Basic Cache Optimizations 19


Advanced Cache Optimizations

20
Basic Ideas
➺ Reduce hit time, miss rate, and miss penalty
➺ We have seen some optimizations

➺ Increase cache bandwidth


➺ Bandwidth improves faster than latency

➺ Reduce miss rate/penalty via parallelism


➺ Exploit parallelism

2.3 Advanced Cache Optimizations 21


Opt 1 – Small and Simple L1 Caches
➺ Critical timing path in cache hit:
1. addressing tag memory, then
2. comparing tags, then
3. selecting correct block

➺ Direct-mapped caches can overlap tag comparison and


transmission of data
➺ Lower associativity reduces power because fewer cache
lines are accessed.
➺ Higher associativity
➺ Increases size of virtually indexed cache
➺ Reduces conflict misses due to multithreading

2.3 Advanced Cache Optimizations 22


Opt 1 – L1 Size and Associativity

➺ Access time increases as size & associativity increases


➺ As size increases, associativity other than 8 leads to smaller
differences in access time.
2.3 Advanced Cache Optimizations 23
Opt 1 – L1 Size and Associativity

Energy per read increases as size & associativity increases


2.3 Advanced Cache Optimizations 24
Opt 2 – Way Prediction

Tags Data

Way prediction set them before tag comparison is done.


2.3 Advanced Cache Optimizations 25
Opt 2 – Way Prediction

tag 2-bit index offset

1
2-bit

2 Way 2-bit
= = = = Predictor
2
<<
=
data
3 match
if mis-match, the prediction is wrong

2.3 Advanced Cache Optimizations 26


Opt 2 – Way Prediction
➺ Low order tag bits → prediction
➺ Predicts the next block to be accessed – what locality?
➺ Multiplexer could be set early to select the predicted block,
only a single tag comparison
➺ A mis-prediction results in checking the other blocks
➺ Prediction accuracy could be:
➺ > 90% for two-way
➺ > 80% for four-way
➺ I-cache has better accuracy than D-cache
➺ First used on MIPS R10000 in mid-90s
➺ Extend to decide which block to access using way selection
➺ Intends to save power consumption.
➺ Increases mis-prediction penalty

2.3 Advanced Cache Optimizations 27


Opt 2 – Way Prediction
➺ Performance evaluation
➺ HT1 : hit time when prediction is good
➺ HT2 : hit time when prediction is wrong
➺ MR : miss rate
➺ MP : miss penalty
➺X : prediction accuracy

𝐴𝑀𝐴𝑇 = (𝐻𝑇1 ⋅ 𝑋 + 1 − 𝑋 ⋅ 𝐻𝑇2 ) + 𝑀𝑅 ⋅ 𝑀𝑃

2.3 Advanced Cache Optimizations 28


Opt 2 – Way Prediction Example
➺ Miss rate = 0.1,
➺ Miss penalty = 50 ns
➺ HT = 3 ns without way prediction
➺ HT1 = 2.5 ns when prediction is good
➺ HT2 = 3.5 ns when prediction is wrong
➺ Prediction accuracy X = 0.8
➺ ==============================================
➺ AMAT_nw = 3 + 0.1*50 = 8 ns (no way prediction)
➺ AMAT_W = (2.5*0.8 + 3.5*0.2) + 0.1*50 = 7.7 ns
➺ Speedup = 8/7.7 = 1.05
➺ Speedup on HT = 3/2.7 = 1.15
2.3 Advanced Cache Optimizations 29
Opt 3 – Pipelining Cache
➺ Increase bandwidth
➺ Pipeline cache access to improve bandwidth
➺ Trend: shorter clock cycle, but hit time relative stable
➤ Cache hit handled in multiple cycles
➺ Examples – hit time in cycles
➤ Pentium: 1 cycle
➤ Pentium Pro – Pentium III: 2 cycles
➤ Pentium 4 – Core i7: 4 cycles

➺ Makes it easier to increase associativity


➺ In associative cache, tag compare and data output are serialized

➺ Increases branch mis-prediction penalty


➺ Cache pipeline needs to be flushed in case of mis-prediction

2.3 Advanced Cache Optimizations 30


Opt 3 – Multibanked Caches
➺ Organize cache as independent banks to support
simultaneous access
➺ ARM Cortex-A8 supports 1-4 banks for L2
➺ Intel i7 supports 4 banks for L1 and 8 banks for L2

➺ Interleave banks according to block address


➺ Mapping to banks: bank index = block address mod #banks

Figure 2.10 Four-way interleaved cache banks using block addressing. Assuming 64 bytes per block, each of these addresses would
be multiplied by 64 to get byte addressing.

2.3 Advanced Cache Optimizations 31


Opt 4 – Nonblocking Caches
➺ Allow data cache to service hits during a miss
➺ Reduces effective miss penalty
➺ “Hit under miss”
● Extended to “Hit under multiple miss”
➺ L2 often support this

better

2.3 Advanced Cache Optimizations 32


Opt 5 – Critical Word First, Early Restart
➺ Processor needs one word in a block
➺ Critical word first
➺ Request missed word from memory first
➺ Send it to the processor as soon as it arrives

➺ Early restart
➺ Fetch the block in normal order
➺ Send missed word to the processor as soon as it arrives

➺ Effectiveness of these strategies depends on block size


and likelihood of another access to the portion of the
block that has not yet been fetched
➺ More benefits if block size is larger

2.3 Advanced Cache Optimizations 33


Opt 6 – Merging Write Buffer
➺ When storing to a block that is already pending in the write buffer,
update write buffer
➺ Reduces stalls due to full write buffer
➺ Do not apply to I/O addresses

w/o write
merging

w write
merging

2.3 Advanced Cache Optimizations 34


Opt 7 – Compiler Optimizations
➺ Gap between CPU and memory requires SW developer
to look at memory hierarchy

0 x[0][0]
/* before */ 1 x[0][1]
for (j=0; j < 16; j++)
for (i=0; i < 16; i++) ... …
x[i][j] = 2*x[i][j]; 15 x[0][15]
16 x[1][0]
… …

2.3 Advanced Cache Optimizations 35


Opt 7 – Compiler Optimizations
➺ Loop Interchange
➺ Swap nested loops to access memory in sequential order
➺ Expose spatial locality

0 x[0][0]
/* after*/ 1 x[0][1]
for (i=0; i < 16; i++) ... …
for (j=0; j < 16; j++) 15 x[0][15]
x[i][j] = 2*x[i][j];
16 x[1][0]
… …

2.3 Advanced Cache Optimizations 36


Opt 7 – Compiler Optimizations
➺ Blocking
➺ Instead of accessing entire rows or columns, subdivide matrices
into blocks
➺ Requires more memory accesses but improves temporal
locality of accesses
➺ See the book for an example

2.3 Advanced Cache Optimizations 37


Opt 8 – Hardware Prefetching
➺ Fetch two blocks on miss (include next sequential block)
➺ Can hurt power if prefetched data are not used.

better

Some results obtained on Pentium 4 w. Pre-fetching


2.3 Advanced Cache Optimizations 38
Cache - Summary
➺ A small and fast buffer between CPU and main memory
➺ Direct mapped cache
➺ shorter hit time, higher miss rate, lower power consumption

➺ Associative cache
➺ longer hit time, lower miss rate, higher power consumption

➺ Performance evaluation
➺ AMAT – average memory access time: maybe misleading
➺ CPI with stall cycles due to cache misses: more accurate

➺ Various optimization to improve performance

2.3 Advanced Cache Optimizations 39


40
Example: Intel Core I7 Cache

L2

➺ L1 I$ - 32 KB, L1 D$ - 32KB 8-way set associative, private


➺ L2 – 256 KB, 8-way set associative, private
➺ L3 – 8 MB, 16-way set associative, shared
41
Virtual Memory
A Simple View of Memory Hierarchy

CPU
word
Cache
block
Memory
page

Hard Disk

B.3 Six Basic Cache Optimizations 2


Why Virtual Memory
➺ Limited physical memory leads to complications
➺ Size of a program is larger than main memory size
➺ Memory demand of multiple programs is larger than
main memory size

➺ Observation: a program often needs to small part


of memory during its execution
➺ Load what is needed into main memory!

B.4 Virtual Memory 3


Why Virtual Memory
➺ Programs were divided into pieces and identified
pieces that are mutually exclusive
➺ These pieces were loaded and unloaded under
user program control during execution
➺ Calls between procedures in different modules
lead to overlaying of one module with the other
➺ Used to be done by hand
➺ Significant burden on programmers

B.4 Virtual Memory 4


Basics of Virtual Memory
➺ Programs use virtual addresses (VA)
➺ Memory uses physical addresses (PA)
➺ VA → PA at the page granularity
➺ pages can be anywhere in physical memory
➺ or in disk

OS App1 App2
… … …

Disk

B.4 Virtual Memory 5


Basics of Virtual Memory
➺ Use physical DRAM as cache for disk
➺ Address space of a process can exceed physical memory size
➺ Sum of address spaces of multiple processes can exceed physical
memory
➺ Simplify memory management
➺ Multiple processes resident in main memory

Each process with its own address space
➺ Only “active” code and data is actually in memory
➤ Allocate more memory to process as needed

➺ Provide protection
➺ One process can’t interfere with another

Because they operate in different address spaces
➺ User process cannot access privileged information
➤ Different sections of address space have different permissions
B.4 Virtual Memory 6
A Simplified View of Virtual Memory

Memory

0:
Page Table 1:
Virtual
Addresses 0:
1: Physical
Addresses

CPU

P-1:
N-1:

Disk
B.4 Virtual Memory 7
Cache vs Virtual Memory

B.4 Virtual Memory


Cache vs Virtual Memory
➺ Terminology
➺ Block → page
➺ Cache miss → page fault
➺ Replacement on cache memory misses by
hardware whereas virtual memory replacement is
by OS
➺ Size of processor address determines size of VM
whereas cache size is independent of address size
➺ VM can have fixed or variable size blocks
➺ page vs segmentation: they both have pros and cons

B.4 Virtual Memory 9


Virtual Memory Design Issues
➺ Page size should be large enough to try to
amortize high access time of disk
➺ Typical size: 4KB – 16KB
➺ Reducing page fault rate is important
➺ Fully associative placement of pages in memory
➺ Page faults not handled by hardware
➺ OS can afford to use clever algorithm for page
replacement to reduce page fault rate
➺ Write-through approach is too expensive
➺ Write back approach is always used

B.4 Virtual Memory 10


Virtual Memory Address Translation

From CPU

n–1 p p–1 0
virtual address
virtual page number page offset
[n-1..0]

Address Translation

m–1 p p–1 0
physical address
physical page number page offset
[m-1..0]

To memory
B.4 Virtual Memory 11
Address Translation
page table base register virtual address
n–1 p p–1 0
virtual page number (VPN) page offset

valid access physical page number (PPN)

if valid=0 then page


is not in memory

m–1 p p–1 0
physical page number (PPN) page offset

physical address

• Each process has its own page table. 12


B.4 Virtual Memory
Address Translation - Example
Page Table base register
0xFFFF87F8
Virtual Page Number Page Offset
1111 1111 1010 1000 1010 1111 1101 1100

0 …

1111 1010 1111

1111 1111 1111 1111 …


page table

Physical Address: 1111 1010 1111 1010 1111 1101 1100


Physical Page Page Offset
Number

B.4 Virtual Memory 13


ReviewAddress
Avoid the cache
Translation
optimization
during Cache Indexing to Reduce Hit Time

Review the cache optimization


Avoid Address Translation during Cache
Indexing to Reduce Hit Time

14
Page Faults
➺ What if object is on disk rather than in memory?
➺ Page table entry indicates virtual address not in memory
➺ OS exception handler invoked to move data from disk into
memory
➺ Current process suspends, others can resume
➤ OS has full control over placement, etc.

Before fault Memory After fault


Memory
Page Table
Virtual Page Table
Physical
Addresses Addresses Virtual Physical
Addresses Addresses
CPU
CPU

Disk
B.4 Virtual Memory Disk 15
Servicing Page Faults
(1) Initiate Block Read
➺ Processor signals controller
Processor
➺ Read block of length P starting Reg (3) Read
at disk address X and store Done
starting at memory address Y
Cache
➺ Read occurs
➺ Direct Memory Access (DMA)
Memory-I/O bus
➺ Under control of I/O controller
(2) DMA
➺ I/O controller signals Transfer I/O
completion Memory controller
➺ Interrupt processor
➺ OS resumes suspended
disk
Disk disk
Disk
process

B.4 Virtual Memory 16


Page Replacement
➺ When there are no available free pages to handle a fault,
we must find a page to replace.
➺ This is determined by the page replacement algorithm.

➺ The goal of the replacement algorithm is to reduce the


fault rate by selecting the best victim page to remove.
➺ Again, due to high access latency to disk

➺ LRU is the most popular replacement policy.


➺ Each page is associated with a use/reference bit.
➺ It is set whenever a page is accessed.
➺ OS periodically clears these bits.
➺ Pages with the use bit reset are candidates for replacement

B.4 Virtual Memory 17


Fast Address Translation – TLB

tags from VAs

Physical page number

B.4 Virtual Memory 18


TLB in AMD Opteron

Figure B.24 Operation of the Opteron data TLB during address translation. The four steps of a TLB hit are shown
as circled numbers. This TLB has 40 entries. Section B.5 describes the various protection and access fields of an
Opteron page table entry.

B.4 Virtual Memory 19


Question: Page size and Page Table Size
➺ Assume 32-bit virtual address
➺ Page table size if page size is 4K?
➺ Page table size if page size is 16K?

➺ What about 64-bit virtual address?

B.4 Virtual Memory 20


Multi-Level Page Table
Virtual address

• Higher level PTEs point to lower level tables


• PTEs at bottom level hold PPN
• Different parts of VAs used to index tables
at different levels

21
Multi-Level Page Table
Virtual address

• Large VA space is unused


• If a entry in level L is null, the tables at
lower levels (those to the right) are not
stored in memory

22
Selecting Page Size – Large Page Size
➺ Smaller page table
➺ A page table can occupy a lot of space
➺ Larger L1 cache
➺ Less TLB misses, faster translation
➺ More efficient to transfer larger pages from 2nd
storage
➺ Wasted memory usage
➺ Wasted IO bandwidth
➺ Slower process start up

B.4 Virtual Memory 23


Virtual Memory – Protection
➺ Virtual memory and multiprogramming
➺ Multiple processes sharing processor and physical memory

➺ Protection via virtual memory


➺ Keeps processes in their own memory space
➺ only OS can update page tables

➺ Role of architecture:
➺ Provide user mode and supervisor mode
➺ Protect certain aspects of CPU state: PC, register, etc
➺ Provide mechanisms for switching between user mode and
supervisor mode
➺ Provide mechanisms to limit memory accesses
➺ Provide TLB to translate addresses

B.5 Protection of Virtual Memory 24


Page Table Entries – Protection
➺ Address – physical page number
➺ Valid/present bit
➺ Modified/dirty bit
➺ Reference bit
➺ For LRU

● Protection bits – Access right field defining allowable


accesses
→ R/W/X: read only, read-write, execute only
→ User/Supervisor

25
Page Table Entries – Protection
➺ Check bits on each access and during a page fault
➺ If violated, generate exception (Access Protection exception)

26
Summary
➺ OS virtualizes memory and IO devices
➺ Each process has an illusion of private CPU and
memory
➺ Virtual memory
➺ Arbitrarily large memory, isolation/protection, inter-
process communication
➺ Reduce page table size
➺ Translation buffers – cache for page table
➺ Manage TLB misses and page faults
➺ Protection

27

You might also like