Instruction Sets - Appendix B Operand Storage
Instruction Sets - Appendix B Operand Storage
Instruction Sets - Appendix B Operand Storage
“Instruction set architecture is the structure of a computer that a Why in the processor?
machine language programmer (or a compiler) must understand
to write a correct (timing independent) program for that machine” • faster access
- IBM introducing 360 in 1964 • shorter address
Instruction set aspects
Accumulator
• operands
+ less hardware
• memory issues
– high memory traffic
• operations (mostly control)
– likely bottleneck
Compilers (paper #4)
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 1 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 2
– registers must be named – cannot take address of a register (distinct from memory)
– code density and “second” name space – fixed size (FP, strings, structures)
– compilers must manage (an advantage?)
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 3 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 4
Registers vs. Memory Operands for ALU Instructions
How many registers? more => ALU instructions combines operands
+ hold operands longer (reducing memory traffic + run time) Number of explicit operands
– longer register specifiers (except with register windows) • two - ri := ri op rj
– slower registers • three - ri := rj op rk
– more state slows context switches
operands in registers or memory
• any combo - VAX - orthogonal but variable length intrs
• at least one register - IBM 360/370 - not orthogonal
• all registers - Cray, RISCs - orthogonal but loads/stores
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 5 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 6
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 7 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 8
Operand Alignment Alignment
What is alignment? No restrictions
• address mod size = 0 • simpler software
• natural boundaries • hardware must detect misalignment
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 9 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 10
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 11 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 12
DSP Addressing modes Operations
Eg 1 - modulo or circular Arithmetic and logical - and, add
• because DSPs deal with streams of data they use circular Data transfer - move, load
buffer
Control - branch, jump, call
• this mode naturally implements such buffers
System - system call, traps
Eg 2 - bit reverse
• specifically for FFT - a common DSP operation Floating point - add, mul, div, sqrt
• FFT shuffles data in a particular order, this mode does Decimal - addd, convert
• 000 -> 000, 001 -> 100, 101 -> 010, 011 -> 110, 100 -> 001 String - move, compare
These modes are hard for compiler, linking to assembly-level
libraries allow use by programmers
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 13 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 14
Other peculiarities:
• in real time no time for exceptions => instead of excepting
on overflow they use saturating arithmetic => if result is too
small or large hardware “saturates” to the largest or
smallest representable number
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 15 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 16
Control Instructions Taken or Not?
Aspects Compare and branch
• 1. taken or not + no extra compare, no state passed between instructions
• 2. where is the target – requires ALU op, restricts code scheduling opportunities
• 3. link return address Implicitly set condition codes - Z, N, V, C
• 4. save or restore + can be set “for free”
Instructions that change the PC – constraints code reordering, extra state to save/restore
• (conditional) branches [1-2], (unconditional) jumps [2] Explicitly set condition codes
• function calls [2,3,4], function returns [2,4] + can be set for free, decouples branch/fetch from pipeline
• system calls [2,3,4], system returns [2,4] – extra state to save/restore
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 17 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 18
Some data for MIPS – branch execution and target separated in pipeline
• > 80% branches use immediate data, > 80% of those zero PC relative with immediate
• 50% branches use == 0 or <> 0 + position independent, target computable in branch unit
Compromise in MIPS + short immediate sufficient - #bits: <4 (47%), <8 (94%)
• compare instructions for all other compares – other techniques needed for returns, distant jumps
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 19 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 20
Where is the Target KEY: Connection to pipelining
Register Control flow instructions affect which instruction is fetched next
+ short specifier, can jump anywhere, dynamic target ok (ret) Fetching occurs at the frontend of the pipeline
– extra instruction to load register • fetch in frontend, and process in the backend
– branch and target separated in pipeline
Register file is in the backend of the pipeline
Vectored trap - critical for OS calls
If pipeline is deep (done to make each stage small and fast, and
+ protection hence clock speed higher, so there are many stages)
– surprises cause implementation headache • frontend and backend far away from each other (in time)
• => if processing branch needs info from backend that will
be slow
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 21 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 22
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 23 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 24
Save or Restore State? Notation
Many recent architectures do no register saving Generic assembly code
Or do implicit register saving with register windows (SPARC) • sub r1, r2, r3
• means r1 := r2 - r3
Data sizes
• byte 8 bits
• halfword 16 bits
• word 32 bits
• doubleword 64 bits
• quad word 128 bits
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 25 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 26
VAX VAX
DEC 1977 VAX-11/780 Data types
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 27 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 28
VAX VAX
Addressing modes Operations
• literal 6 bits • data transfer including string move
• 8, 16, 32 bit immediates • arithmetic and logical (2 and 3 operands)
• register, register deferred • control (branch, jump, etc)
• 8, 16, 32 bit displacements • AOBLEQ
• 8, 16, 32 bit displacements deferred • function calls save state
• indexed (scaled) • bit manipulation
• autoincrement, autodecrement • floating point - add, sub, mul, div, polyf
• autoincrement deferred • system - exception, VM
• other - crc (cyclic redundancy check), insque (insert in Q)
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 29 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 30
VAX 8086
addl3 R1,737(R2),#456 Intel in 1978
• chosen for IBM PC 1980
byte 1: addl3 • remains most popular 16-bit architecture
byte 2: mode,R1
• upward compatible with 8080
byte 3: mode, R2
byte 4-5: 737 • complex - “difficult to explain and impossible to love”
byte 6: mode • special purpose registers
byte 7-10: 456 • 4 arithmetic, 4 address, 4 segment, 2 control
VAX has too many modes and formats • adresses - 16 bit segment<<$ + 16 bit offset
The big deal with RISC is not fewer instrs • 64K 16KB-aligned 64KB segments
• Fewer modes/formats => faster decoding in pipelining • many formats - see H&P
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 31 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 32
MIPS MIPS
• RISC Data transfer
• 32-bit byte addresses aligned • load/store word, load/store byte/halfword signed?
• load/store - only displacement addressing • load/store FP single/double
• standard datatypes • moves between GPRs and FPRs
• 3 fixed length formats ALU
• 32 32-bit GPRs (r0 = 0) • add/subtract signed? immediate?
• 16 64-bit (32 32-bit) FPRs
• multiply/divide signed?
• FP status register
• and,or,xor immediate?, shifts: ll, rl, ra immediate?
• no CCs
• sets immediate?
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 33 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 34
MIPS MIPS
Control 6 5 5 16
I-type: Opcode rs1 rd Immediate
• branches == 0, <> 0
6 5 5 5 11
• conditional branch testing FP bit Opcode rs1 rs2 rd func
R-type:
• jump, jump register 6 26
• jump & link, jump & link register J-type: Opcode Offset added to PC
• trap, return-from-exception
I format - ALU immediate, loads/stores, branches, jump register
FP
R format - RRR ALU ops
• add/sub/mul/div single/double
J format - unconditional jumps (& link?)
• fp converts, fp set
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 35 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 36
Compilers 101 (Wulf’s paper) Compilers 101
Wm. Wulf’s “ compilers and architecture” Phases to manage complexity
Parsing --> intermediate representation
Compiler goals:
Procedure inlining
• all correct programs execute correctly Loop Optimizations
• most compiled programs execute fast (optimizations) Common Sub-Expression
• fast compilation Jump Optimization
• debugging support Constant Propagation
Register Allocation
Strength Reduction
Pipeline Scheduling
Code Generation --> assembly code
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 37 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 38
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 39 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 40
Compiler 101 RISC vs. CISC (Clark&Bhandarkar)
one solution or all possible solutions Clark& Bhandarkar ASPLOS paper: VAX 8700 vs MIPS R3000
• 2 branch conditions - eq, lt Combines 3 features
• or all six - eq, ne, lt, gt, le, ge • architecture
• not 3 or 4 • implementation
Primitives NOT solutions • compilers and OS
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 41 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 42
RISC factor: {CPIVAX * InstrVAX }/ {CPIMIPS * InstrMIPS } • increase VAX CPI but decrease VAX instruction count
• increase MIPS instruction count
instruction CPI CPI CPI RISC • e.g. 1: loads/stores vs. operand specifiers
Benchmark
ratio MIPS VAX ratio factor
• e.g. 2: necessary complex intructions: loop branches
li 1.6 1.1 6.5 6.0 3.7
eqntott 1.1 1.3 4.4 3.5 3.3 Factors favoring VAX
tomcatv 2.9 2.1 17.5 8.2 2.9 • not-taken branches incur no delay
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 43 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 44
RISC vs. CISC Technology Scaling (Borkar’s paper)
Factors favoring MIPS Why is technology scaling so important?
• operand specifier decoding What are the goals of scaling? (keep in mind that 0.7 = 1/sqrt(2))
• number of registers • reduce gate delay by 30% => clock up by 43% (1/0.7)
• separate floating point unit • double transistor density
• simple branches/jumps (lower latency) • reduce energy per switch by 65% => reduce power by 50%
• no complex instructions
Scaling theory:
• instruction scheduling
• delay = 0.7 => frequency = 1.43
• translation buffer
• width, length, thickness = 0.7 => area cap, fringe cap = 0.7
• branch displacement size
• total cap = 0.7, total area = 0.72 = 0.5
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 45 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 46
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 47 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 48
Projections Projections
Power will increase to 10KW if Vdd not scaled If Vdd is scaled then threshold voltage Vt must scale (take
ECE559 is you do not understand this)
• if Vdd is scaled it will be 2KW in 2010
• but lower Vt => exponentially more leakage energy
• if die size is restricted it will be 100W for small die, 200W
for large die • i.e., Vdd scaling reduces dynamic energy but
increases leakage energy!
Energy-delay trade-off
• leakage increases roughly 5x every generation
• why scale only delay? why not both energy and delay?
• today leakage is 15% soon it will equal dynamic power
• then energy-delay is better metric than delay alone
• if Vdd is not scaled energy-delay reduces by 50%
• if Vdd is scaled then energy-delay reduces by 75%
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 49 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 50
Technology
I have research projects on all these topics (not a coincidence!) • power - dynamic and static
• leakage, power, power density, soft errors, noise . . . .
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 51 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 52
Challenges for Next Decade Challenges for Next Decade
Optimal power-performance pipelines are shallower than that for Options
pure performance
• Chip multiprocessors (CMPs) alleviate power problem
• because twice the speed comes at eight times power
• why will this reduce power?
• pipeline depth fundamentally determines power • but software has to be able to take advantage of
• deep pipeline + lower Vdd for power => poorer performance CMPs
© 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 53 © 2009 by Vijaykumar ECE565 Lecture Notes: Appendix B 54