Instruction Sets - Appendix B Operand Storage

Instruction Sets - Appendix B Operand Storage
“Instruction set architecture is the structure of a computer that a Why in the processor?
machine language programmer (or a compiler) must understand
to write a correct (timing independent) program for that machine” • faster access
- IBM introducing 360 in 1964 • shorter address
Instruction set aspects
Accumulator
• operands
+ less hardware
• memory issues
– high memory traffic
• operations (mostly control)
– likely bottleneck
Compilers (paper #4)
READ papers #4-7, Appendix B
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 1 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 2
Operand Storage Memory vs. Registers

Stack - LIFO (60’s - 70’s) Registers
+ code density (top of stack implicit) + faster (no addressing modes, no tags)
– bottleneck while pipelining (why?) + deterministic (no misses)
• note: JAVA VM stack-based + can replicate for more ports
Registers - 8 to 256 words + short identifier
+ flexible: temporaries and variables – must save/restore on procedure calls
– registers must be named – cannot take address of a register (distinct from memory)
– code density and “second” name space – fixed size (FP, strings, structures)
– compilers must manage (an advantage?)
Registers vs. Memory Operands for ALU Instructions
How many registers? more => ALU instructions combines operands
+ hold operands longer (reducing memory traffic + run time) Number of explicit operands
– longer register specifiers (except with register windows) • two - ri := ri op rj
– slower registers • three - ri := rj op rk
– more state slows context switches
operands in registers or memory
• any combo - VAX - orthogonal but variable length intrs
• at least one register - IBM 360/370 - not orthogonal
• all registers - Cray, RISCs - orthogonal but loads/stores
Operands for DSP Endian Wars

Integer and floating point operands are common in general- Order of bytes in words
purpose
• Big endian - MSB at address xxxx00
DSPs have fixed point • Little endian MSB at address xxxx11
• used to represent a vertex
Big endian - IBM, Motorola, SPARC
• the binary point is to the right of the least significant bit
Little endian - DEC, Intel
• 0100 0000 0000 0000 = 2-1
• Fixed point is poor-man’s floating point Mode selectable
• without exponent or h/w normalization as in FP • common today
Operand Alignment Alignment
What is alignment? No restrictions
• address mod size = 0 • simpler software
• natural boundaries • hardware must detect misalignment
e.g., aligned word (4 bytes) • and make 2 memory accesses
• 10 11 12 13 20 • expensive logic, slows down all references (why?)
• d0 d1 d2 d3 - • sometimes required for backward compatibility
e.g., unaligned word (4 bytes)

• 10 11 12 13 20
•- d0 d1 d2 d3
Alignment Addressing Modes

Restricted alignment • register: Ri displacement M[Ri + #n]
• software must guarantee alignment • immediate: #n register indirect M[Ri]
• hardware only detects misalignment and traps • indexed: M[Ri + Rj] absolute: M[#n]
• trap handler does it
• memory indirect: M[M[Ri]] auto-increment: M[Ri]; Ri += d
Middle ground • auto-decrement: M[Ri]; Ri -= d
• misaligned data ok but requires multiple instructions
• scaled: M[Ri + #n + Rj * d]
• compiler must still know
• update: M[Ri = Ri + #n]
• still trap on misaligned access
Top 4 modes cover 93% of all VAX operands [Clark and Emer]
DSP Addressing modes Operations
Eg 1 - modulo or circular Arithmetic and logical - and, add
• because DSPs deal with streams of data they use circular Data transfer - move, load
buffer
Control - branch, jump, call
• this mode naturally implements such buffers
System - system call, traps
Eg 2 - bit reverse
• specifically for FFT - a common DSP operation Floating point - add, mul, div, sqrt
• FFT shuffles data in a particular order, this mode does Decimal - addd, convert
• 000 -> 000, 001 -> 100, 101 -> 010, 011 -> 110, 100 -> 001 String - move, compare
These modes are hard for compiler, linking to assembly-level
libraries allow use by programmers
DSP Operations DSP Operations

DSP has many 8-bit, 16-bit, 32-bit operands • special rounding modes like IEEE
Instead of wasting 64-bit ALUs and datapath for narrow operands • multiply-accumulate instruction because accumulating
series of products is common
• DSPs allow 4 16-bit operations in parallel in 64-bit datapath
Other peculiarities:
• in real time no time for exceptions => instead of excepting
on overflow they use saturating arithmetic => if result is too
small or large hardware “saturates” to the largest or
smallest representable number
Control Instructions Taken or Not?
Aspects Compare and branch
• 1. taken or not + no extra compare, no state passed between instructions
• 2. where is the target – requires ALU op, restricts code scheduling opportunities
• 3. link return address Implicitly set condition codes - Z, N, V, C
• 4. save or restore + can be set “for free”
Instructions that change the PC – constraints code reordering, extra state to save/restore
• (conditional) branches [1-2], (unconditional) jumps [2] Explicitly set condition codes
• function calls [2,3,4], function returns [2,4] + can be set for free, decouples branch/fetch from pipeline
• system calls [2,3,4], system returns [2,4] – extra state to save/restore
Taken or Not Where is the Target?

Condition in general-purpose register Arbitrary specifier
+ no special state but uses up a register + orthogonal
– branch condition separate from branch logic in pipeline – more bits to specify, more time to decode
Some data for MIPS – branch execution and target separated in pipeline
• > 80% branches use immediate data, > 80% of those zero PC relative with immediate
• 50% branches use == 0 or <> 0 + position independent, target computable in branch unit
Compromise in MIPS + short immediate sufficient - #bits: <4 (47%), <8 (94%)
• branch==0, branch<>0 – target must be known statically, can’t jump far
• compare instructions for all other compares – other techniques needed for returns, distant jumps
Where is the Target KEY: Connection to pipelining
Register Control flow instructions affect which instruction is fetched next
+ short specifier, can jump anywhere, dynamic target ok (ret) Fetching occurs at the frontend of the pipeline
– extra instruction to load register • fetch in frontend, and process in the backend
– branch and target separated in pipeline
Register file is in the backend of the pipeline
Vectored trap - critical for OS calls
If pipeline is deep (done to make each stage small and fast, and
+ protection hence clock speed higher, so there are many stages)
– surprises cause implementation headache • frontend and backend far away from each other (in time)
• => if processing branch needs info from backend that will
be slow
Link Return Address Save or Restore state?

Implicit register - many recent architectures use this What state?
+ fast, simple • function calls: registers
– s/w save register before next call, surprise traps? • system calls: registers, flags, PC, PSW, etc
Explicit register Hardware need not save registers

+ may avoid saving register • caller can save registers in use
– register must be specified • callee save registers it will use
Processor stack Hardware register save

+ recursion direct • IBM STM, VAX CALLS
– complex instructions • faster?
Save or Restore State? Notation
Many recent architectures do no register saving Generic assembly code
Or do implicit register saving with register windows (SPARC) • sub r1, r2, r3
• means r1 := r2 - r3
Data sizes
• byte 8 bits
• halfword 16 bits
• word 32 bits
• doubleword 64 bits
• quad word 128 bits
VAX VAX
DEC 1977 VAX-11/780 Data types
Upward compatible from PDP-11 • 8, 16, 32, 64, 128

• char string - 8 bits/char
32-bit words and addresses
• decimal - 4 bits/digit
Virtual memory
• numberic string - 8 bits/digit
16 GPRs (r15 PC r14 SP), CCs
Extremely orthogonal and memory-memory
Decode as byte stream - variable in length

• opcode: operation, #operands, operand types
VAX VAX
Addressing modes Operations
• literal 6 bits • data transfer including string move
• 8, 16, 32 bit immediates • arithmetic and logical (2 and 3 operands)
• register, register deferred • control (branch, jump, etc)
• 8, 16, 32 bit displacements • AOBLEQ
• 8, 16, 32 bit displacements deferred • function calls save state
• indexed (scaled) • bit manipulation
• autoincrement, autodecrement • floating point - add, sub, mul, div, polyf
• autoincrement deferred • system - exception, VM
• other - crc (cyclic redundancy check), insque (insert in Q)
VAX 8086
addl3 R1,737(R2),#456 Intel in 1978
• chosen for IBM PC 1980
byte 1: addl3 • remains most popular 16-bit architecture
byte 2: mode,R1
• upward compatible with 8080
byte 3: mode, R2
byte 4-5: 737 • complex - “difficult to explain and impossible to love”
byte 6: mode • special purpose registers
byte 7-10: 456 • 4 arithmetic, 4 address, 4 segment, 2 control
VAX has too many modes and formats • adresses - 16 bit segment<<$ + 16 bit offset
The big deal with RISC is not fewer instrs • 64K 16KB-aligned 64KB segments
• Fewer modes/formats => faster decoding in pipelining • many formats - see H&P
MIPS MIPS
• RISC Data transfer
• 32-bit byte addresses aligned • load/store word, load/store byte/halfword signed?
• load/store - only displacement addressing • load/store FP single/double
• standard datatypes • moves between GPRs and FPRs
• 3 fixed length formats ALU
• 32 32-bit GPRs (r0 = 0) • add/subtract signed? immediate?
• 16 64-bit (32 32-bit) FPRs
• multiply/divide signed?
• FP status register
• and,or,xor immediate?, shifts: ll, rl, ra immediate?
• no CCs
• sets immediate?
MIPS MIPS
Control 6 5 5 16
I-type: Opcode rs1 rd Immediate
• branches == 0, <> 0
6 5 5 5 11
• conditional branch testing FP bit Opcode rs1 rs2 rd func
R-type:
• jump, jump register 6 26
• jump & link, jump & link register J-type: Opcode Offset added to PC
• trap, return-from-exception
I format - ALU immediate, loads/stores, branches, jump register
FP
R format - RRR ALU ops
• add/sub/mul/div single/double
J format - unconditional jumps (& link?)
• fp converts, fp set
Compilers 101 (Wulf’s paper) Compilers 101
Wm. Wulf’s “ compilers and architecture” Phases to manage complexity
Parsing --> intermediate representation
Compiler goals:
Procedure inlining
• all correct programs execute correctly Loop Optimizations
• most compiled programs execute fast (optimizations) Common Sub-Expression
• fast compilation Jump Optimization
• debugging support Constant Propagation
Register Allocation
Strength Reduction
Pipeline Scheduling
Code Generation --> assembly code
Compilers 101 Compiler 101

Procedure inlining 10% what compiler writers want:
local optimization 5% • regularity - similar structure across instructions

• orthogonality - across operation, data type, addressing
register allocation 21%
• composability - results from one directly to another
global + local 14%
• regularity and orthogonality => composability
global+local+reg-alloc 63%
compilers perform a giant case analysis
everything 81% • too many choices make it hard
local: common subexpression, constant propagation
orthogonal instruction sets
global: common subexpression, loop invariant code motion • operation, addressing mode, data type
Compiler 101 RISC vs. CISC (Clark&Bhandarkar)
one solution or all possible solutions Clark& Bhandarkar ASPLOS paper: VAX 8700 vs MIPS R3000
• 2 branch conditions - eq, lt Combines 3 features
• or all six - eq, ne, lt, gt, le, ge • architecture
• not 3 or 4 • implementation
Primitives NOT solutions • compilers and OS
“. . . by giving too much semantic content to the Argues that

instruction, the machine designer made it possible to use the
instruction only in limited contexts. In many cases the complex • implementation effects are second order
instructions are synthesized from more primitive operations, • compilers are similar
which if the compiler had access to, could be recomposed to
more closely model the feature actually needed.” • RISCs are better than CISCs: fair comparison?
RISC vs. CISC RISC vs. CISC

Recall Iron Law: Time = #instructions x CPI x clock period Compensating factors
RISC factor: {CPIVAX * InstrVAX }/ {CPIMIPS * InstrMIPS } • increase VAX CPI but decrease VAX instruction count
• increase MIPS instruction count
instruction CPI CPI CPI RISC • e.g. 1: loads/stores vs. operand specifiers
Benchmark
ratio MIPS VAX ratio factor
• e.g. 2: necessary complex intructions: loop branches
li 1.6 1.1 6.5 6.0 3.7
eqntott 1.1 1.3 4.4 3.5 3.3 Factors favoring VAX
fpppp 2.9 1.5 15.2 10.5 2.7 • big immediate values
tomcatv 2.9 2.1 17.5 8.2 2.9 • not-taken branches incur no delay
RISC vs. CISC Technology Scaling (Borkar’s paper)
Factors favoring MIPS Why is technology scaling so important?
• operand specifier decoding What are the goals of scaling? (keep in mind that 0.7 = 1/sqrt(2))
• number of registers • reduce gate delay by 30% => clock up by 43% (1/0.7)
• separate floating point unit • double transistor density
• simple branches/jumps (lower latency) • reduce energy per switch by 65% => reduce power by 50%
• no complex instructions
Scaling theory:
• instruction scheduling
• delay = 0.7 => frequency = 1.43
• translation buffer
• width, length, thickness = 0.7 => area cap, fringe cap = 0.7
• branch displacement size
• total cap = 0.7, total area = 0.72 = 0.5
Scaling Trends Scaling Trends

Clock frequency Interconnect
• improves by factor of 2, not just 1.43, every generation • width and thickness decrease with transistors
• 1.43 to 2 comes from circuits and microarchitecture • interconnect distribution for new microarchitecture not
• mainly less work per clock => more pipeline stages => different than that of old => complexity is not reason for drop
in density
deeper pipes => branch and cache miss penalty worse
(making architects work harder to achieve performance)
Power = fCV2
Transistor density (#devices/area) • reduce by 50% => scale Vdd by 0.7 (constant field scaling)
• doubles as expected if old microarchitecture is shrunk • will not reduce if constant voltage scaling
• if new microarchitecture then density is not double • active capacitance density should increase by 43% but
• more complexity and less time to optimize? only ~30% in reality due to lower logic density
Projections Projections
Power will increase to 10KW if Vdd not scaled If Vdd is scaled then threshold voltage Vt must scale (take
ECE559 is you do not understand this)
• if Vdd is scaled it will be 2KW in 2010
• but lower Vt => exponentially more leakage energy
• if die size is restricted it will be 100W for small die, 200W
for large die • i.e., Vdd scaling reduces dynamic energy but
increases leakage energy!
Energy-delay trade-off
• leakage increases roughly 5x every generation
• why scale only delay? why not both energy and delay?
• today leakage is 15% soon it will equal dynamic power
• then energy-delay is better metric than delay alone
• if Vdd is not scaled energy-delay reduces by 50%
• if Vdd is scaled then energy-delay reduces by 75%
Projections Challenges for Next Decade (Agerwala)

• Vt decreases => noise margins reduce Application pull vs. technology push
• #logic transistors increase => soft error rate increases • standard recipe for innovation in architecture
• soft error means a bit flip due to a neutron strike Apps
• Power density (power/area) increases • aircraft design, electromagnetics simulation, entertainment
• we are already past kitchen hot plate! • all of these are require high compute power
Technology
I have research projects on all these topics (not a coincidence!) • power - dynamic and static
• leakage, power, power density, soft errors, noise . . . .
Challenges for Next Decade Challenges for Next Decade
Optimal power-performance pipelines are shallower than that for Options
pure performance
• Chip multiprocessors (CMPs) alleviate power problem
• because twice the speed comes at eight times power
• why will this reduce power?
• pipeline depth fundamentally determines power • but software has to be able to take advantage of
• deep pipeline + lower Vdd for power => poorer performance CMPs
• than shallow pipeline + high Vdd for performance • Special accelerators

• special hardware for TCP/IP, security
Challenges for Next Decade

Options (cont’d):
• scale out
• cluster of low-cost computers working as one large
computer
• system-wide power management
• compiler, operating system
• turn off things not used, scale down Vdd to required level
because not EVERYTHING needs to run at highest speed
ALWAYS
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 55

Instruction Sets - Appendix B Operand Storage

Uploaded by

Copyright:

Available Formats

Instruction Sets - Appendix B Operand Storage

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Instruction Sets - Appendix B Operand Storage

Uploaded by

Copyright:

Available Formats

Instruction Sets - Appendix B Operand Storage

READ papers #4-7, Appendix B

Operand Storage Memory vs. Registers

Registers - 8 to 256 words + short identifier

+ flexible: temporaries and variables – must save/restore on procedure calls

Operands for DSP Endian Wars

• without exponent or h/w normalization as in FP • common today

e.g., aligned word (4 bytes) • and make 2 memory accesses

• 10 11 12 13 20 • expensive logic, slows down all references (why?)

• d0 d1 d2 d3 - • sometimes required for backward compatibility

e.g., unaligned word (4 bytes)

Alignment Addressing Modes

DSP Operations DSP Operations

Taken or Not Where is the Target?

• branch==0, branch<>0 – target must be known statically, can’t jump far

Link Return Address Save or Restore state?

Explicit register Hardware need not save registers

Processor stack Hardware register save

Upward compatible from PDP-11 • 8, 16, 32, 64, 128

Extremely orthogonal and memory-memory

Decode as byte stream - variable in length

Compilers 101 Compiler 101

local optimization 5% • regularity - similar structure across instructions

“. . . by giving too much semantic content to the Argues that

RISC vs. CISC RISC vs. CISC

fpppp 2.9 1.5 15.2 10.5 2.7 • big immediate values

Scaling Trends Scaling Trends

Projections Challenges for Next Decade (Agerwala)

• than shallow pipeline + high Vdd for performance • Special accelerators

Challenges for Next Decade

© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 55

You might also like