Slide 7
Slide 7
Slide 7
VECTORS COMPUTERS
Noor Mahammad Sk
Super Computer
p p
Definition of a supercomputer:
Definition of a supercomputer:
Fastest machine in world at given task
A device to turn a compute‐bound problem into an I/O
p p /
bound problem
CDC6600 (Cray, 1964) regarded as first supercomputer
Supercomputer Applications
p p pp
Typical application areas
Typical application areas
Military research (nuclear weapons, cryptography)
Scientific research
Weather forecasting
Oil exploration
p
Industrial design (car crash simulation)
All involve huge computations on large data sets
g p g
In 70s‐80s, Supercomputer Vector Machine
Vector Supercomputers
p p
Scalar Unit + Vector Extensions
Scalar Unit + Vector Extensions
Load/Store Architecture
Vector Registers
g
Vector Instructions
Hardwired Control
Highly Pipelined Functional Units
Interleaved Memory System
No Data Caches
No Virtual Memory
Cray‐1 (1976)
y ( )
V0 Vi V. Mask
V1
V2 Vj
64 Element Vector V3 V. Length
Registers V4 Vk
Single Port V5
V6
Memory V7
FP Add
S0 Sj FP Mul
16 banks of 64- ( (Ah) + j k m ) S1
S2 Sk FP Recip
bit words Si S3
(A0) 64 S4 Si Int Add
+ Tjk S5
T Regs S6
8-bit SECDED S7
Int Logic
Int Shift
A0
80MW/sec data ( (Ah) + j k m ) A1 Pop Cnt
load/store Ai
A2
A3
Aj
(A0) 64 A4 Ak Addr Add
Bjk A5
Ai
320MW/sec B Regs A6 Addr Mul
A7
instruction
buffer refill NIP CIP
64-bitx16
LIP
4 Instruction Buffers
memory bank cycle 50 ns processor cycle 12.5 ns (80MHz)
Vector Programming Model
g g
r0 v0
[0] [1] [2] [VLRMAX-1]
Vector Lengthh Register VLR
Vector Programming Model
g g
v1
Vector Arithmetic v2
Instructions + + + + + +
ADDV v3, v1, v2 v3
[0] [1] [VLR-1]
Memory
Base, r1 Stride, r2
Vector Code Example
p
V3 <- v1 * v2
Vector Memory System
y y
ay , 6 ba s, cyc e ba busy e, cyc e a e cy
Cray‐1, 16 banks, 4 cycle bank busy time, 12 cycle latency
Bank busy time: Cycles between accesses to same bank
Base Stride
g
Vector Registers
Address
Generator +
0 1 2 3 4 5 6 7 8 9 A B C D E F
Memory Banks
Vector Instruction Execution
ADDV C,A,B
A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]
Vector
Registers
Elements 0, Elements 1, Elements 2, Elements 3,
4,, 8,, … 5,, 9, … 6,, 10,, … 7,, 11,, …
Lane
Memory Subsystem
Vector Memory‐Memory versus
Vector Register Machines
Vector Register Machines
Vector memory‐memory instructions hold all vector operands in main memory
The first vector machines, CDC Star‐100 (‘73) and TI ASC (‘71), were memory‐
memory machines
Cray‐1 (’76) was first vector register machine
Vector Memory-Memory
Memory Memory Code
Example Source Code ADDV C, A, B
for (i=0; i<N; i++) SUBV D, A, B
{
C[i] = A[i] + B[i]; Vector Register Code
D[i] = A[i] - B[i]; LV V1, A
} LV V2, B
ADDV V3, V1, V2
SV V3,
, C
SUBV V4, V1, V2
SV V4, D
Vector Memory‐Memory vs. Vector
Register Machines
Register Machines
y y ( ) q
Vector memory‐memory architectures (VMMA) require
greater main memory bandwidth, why?
All operands must be read in and out of memory
VMMAs make if difficult to overlap execution of multiple
VMMA k if diffi lt t l ti f lti l
vector operations, why?
Must check dependencies on memory addresses
p y
VMMAs incur greater startup latency
Scalar code was faster on CDC Star‐100 for vectors < 100
elements
l t
For Cray‐1, vector/scalar breakeven point was around 2
elements
Automatic Code Vectorization
for (i=0; i < N; i++)
C[i] = A[i] + B[i];
Iter. 2 l d
load Iter.
Iter Iter.
Iter
1 2 Vector Instruction
Instruction
issue
V V V V
V1
LV v1 2 3 4 5
MULV v3,v1,v2
v3 v1 v2
With chaining, can start dependent instruction as soon as first
h h d d f
result appears
Load
Mul
Add
Vector Startup
p
Two components of vector startup penalty
functional unit latency (time through pipeline)
f i l i l (i h h i li )
dead time or recovery time (time before another vector instruction can start down pipeline)
Functional Unit Latency
R X X X W
R X X X W
R X X X W
Dead Time
R X X X W
R X X X W
R X X X W
Dead Time R X X X W
Second Vector Instruction
R X X X W
Dead Time and Short Vectors
No dead time
64 cycles active
M[7] 1
M[7]=1 A[7] B[7] M[7] 1
M[7]=1
M[6]=0 A[6] B[6] M[6]=0 A[7] B[7]
M[5]=1 A[5] B[5] M[5]=1
M[4]=1 A[4] B[4] M[4]=1
M[3]=0 A[3] B[3] M[3]=0 C[5]
M[2]=0 C[4]
M[1]=1
M[2]=0
M[2] 0 C[2]
M[0]=0
M[1]=1 C[1] C[1]
Write data port
M[0]=0 C[0]
Write Enable Write data port
Compress/Expand Operations
p / p p
Compress packs non‐masked elements from one vector
register contiguously at start of destination vector register
l fd
population count of mask vector gives packed vector length
Expand performs inverse operation
M[7]=1 A[7] A[7] A[7] M[7]=1
M[6]=0 A[6] A[5] B[6] M[6]=0
M[5]=1 A[5] A[4] A[5] M[5]=1
M[4]=1 A[4] A[1] A[4] M[4]=1
M[3]=0 A[3] A[7] B[3] M[3]=0
M[2]=0 A[2] A[5] B[2] M[2]=0
M[1]=1 A[1] A[4] A[1] M[1]=1
M[0]=0 A[0] A[1] B[0] M[0]=0
Compress Expand
Used for density-time conditionals and also for general selection operations
A Modern Vector Super: NEC SX‐6
(2003)
CMOS Technology
500 MHz CPU, fits on single chip
SDRAM main memory (up to 64GB)
Scalar unit
4‐way superscalar with out‐of‐order and speculative execution
y p p
64KB I‐cache and 64KB data cache
Vector unit
8 foreground VRegs + 64 background VRegs (256x64‐bit elements/VReg)
1
1 multiply unit, 1 divide unit, 1 add/shift unit, 1 logical unit, 1 mask unit
lti l it 1 di id it 1 dd/ hift it 1 l i l it 1 k it
8 lanes (8 GFLOPS peak, 16 FLOPS/cycle)
1 load & store unit (32x8 byte accesses/cycle)
32 GB/s memory bandwidth per processor
SMP structure
8 CPUs connected to memory through crossbar
256 GB/s shared memory bandwidth (4096 interleaved banks)
Multimedia Extensions
Very short vectors added to existing ISAs for micros
Usually 64‐bit registers split into 2x32b or 4x16b or 8x8b
Newer designs have 128‐bit registers (Altivec, SSE2)
Limited instruction set:
Limited instruction set:
no vector length control
no strided load/store or scatter/gather
unit‐stride loads must be aligned to 64/128‐bit boundary
it t id l d t b li d t 64/128 bit b d
Limited vector register length:
requires superscalar dispatch to keep multiply/add/load units
busy
loop unrolling to hide latencies increases register pressure
Trend towards fuller vector support in microprocessors
Graphical Processing Units
p g
e e a d a e es ed o do g ap cs e , o
Given the hardware invested to do graphics well, how
can be supplement it to improve performance of a wider
range of applications?
Basic idea:
Heterogeneous execution model
CPU is the host, GPU is the device
Develop a C‐like programming language for GPU
Develop a C like programming language for GPU
Unify all forms of GPU parallelism as CUDA thread
Programming model is
Programming model is “Single
Single Instruction Multiple Thread
Instruction Multiple Thread”
Threads and Blocks
A thread is associated with each data element
A thread is associated with each data element
Threads are organized into blocks
Blocks are organized into a grid
Blocks are organized into a grid
GPU hardware handles thread management, not
GPU hardware handles thread management not
applications or OS
NVIDIA GPU Architecture
Similarities to vector machines:
Works well with data‐level parallel problems
Scatter‐gather transfers
Mask registers
k
Large register files
Differences:
No scalar processor
Uses multithreading to hide memory latency
Has many functional units, as opposed to a few deeply
pipelined units like a vector processor
pipelined units like a vector processor
Example
p
Multiply two vectors of length 8192
Code that works over all elements is the grid
Thread blocks break this down into manageable sizes
512 threads per block
512 threads per block
SIMD instruction executes 32 elements at a time
Thus grid size = 16 blocks
Block is analogous to a strip‐mined vector loop with vector
Bl k i l t ti i d t l ith t
length of 32
Block is assigned to a multithreaded SIMD processor by the
thread block scheduler
thread block scheduler
Current‐generation GPUs (Fermi) have 7‐15 multithreaded
SIMD processors
Terminology
gy
f
Threads of SIMD instructions
Each has its own PC
Thread scheduler uses scoreboard to dispatch
No data dependencies between threads!
d d d b h d!
Keeps track of up to 48 threads of SIMD instructions
Hides memory latency
Hides memory latency
Thread block scheduler schedules blocks to SIMD
processors
Within each SIMD processor:
32 SIMD lanes
Wide and shallow compared to vector processors
Wide and shallow compared to vector processors
Example
p
NVIDIA GPU has 32,768 registers
NVIDIA GPU has 32,768 registers
Divided into lanes
Each SIMD thread is limited to 64 registers
g
SIMD thread has up to:
64 vector registers of 32 32‐bit elements
32 vector registers of 32 64‐bit elements
Fermi has 16 physical SIMD lanes, each containing
2048 registers
NVIDIA Instruction Set Arch.
ISA is an abstraction of the hardware instruction set
“Parallel Thread Execution (PTX)”
Uses virtual registers
Translation to machine code is performed in software
Example:
shl.s32
shl s32 R8 blockId 9
R8, blockIdx, 9 ; Thread Block ID * Block size (512 or 29)
Thread Block ID * Block si e (512 or 29)
add.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread ID
ld.global.f64 RD0, [X+R8] ; RD0 = X[i]
ld global f64 RD2, [Y+R8]
ld.global.f64 RD2 [Y+R8] ; RD2 Y[i]
; RD2 = Y[i]
mul.f64 R0D, RD0, RD4 ; Product in RD0 = RD0 * RD4 (scalar a)
add.f64 R0D, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i])
st global f64 [Y+R8] RD0 ; Y[i] = sum
st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i])
(X[i]*a + Y[i])
Conditional Branchingg
Like vector architectures, GPU branch hardware uses internal
masks
Also uses
Branch synchronization stack
y
Entries consist of masks for each SIMD lane
I.e. which threads commit their results (all threads execute)
Instruction markers to manage when a branch diverges into multiple
execution paths
Push on divergent branch
…and when paths converge
Act as barriers
b
Pops stack
Per‐thread‐lane 1‐bit predicate register, specified by programmer
NVIDIA GPU Memory Structures
y
p p
Each SIMD Lane has private section of off‐chip DRAM
“Private memory”
Contains stack frame, spilling registers, and private
variables
Each multithreaded SIMD processor also has local
memory
Shared by SIMD lanes / threads within a block
Memory shared by SIMD processors is GPU Memory
Host can read and write GPU memory
Fermi Architecture Innovations
Each SIMD processor has
Two SIMD thread schedulers, two instruction dispatch units
16 SIMD lanes (SIMD width=32, chime=2 cycles), 16 load‐store units, 4
special function units
Thus, two threads of SIMD instructions are scheduled every two clock
cycles
Fast double precision
Caches for GPU memory
64‐bit addressing and unified address space
Error correcting codes
Error correcting codes
Faster context switching
Faster atomic instructions
Fermi Multithreaded SIMD Proc.