Slide 7

VECTORS COMPUTERS
VECTORS COMPUTERS
Noor Mahammad Sk
Super Computer
p p
 Definition of a supercomputer:
Definition of a supercomputer:
 Fastest machine in world at given task
 A device to turn a compute‐bound problem into an I/O
p p /
bound problem
 CDC6600 (Cray, 1964) regarded as first supercomputer
Supercomputer Applications
p p pp
 Typical application areas
Typical application areas
 Military research (nuclear weapons, cryptography)
 Scientific research
 Weather forecasting
 Oil exploration
p
 Industrial design (car crash simulation)
 All involve huge computations on large data sets
g p g
 In 70s‐80s, Supercomputer  Vector Machine
Vector Supercomputers
p p
 Scalar Unit + Vector Extensions
Scalar Unit + Vector Extensions
 Load/Store Architecture
 Vector Registers
g
 Vector Instructions
 Hardwired Control
 Highly Pipelined Functional Units
 Interleaved Memory System
 No Data Caches
 No Virtual Memory
Cray‐1 (1976)
y ( )
V0 Vi V. Mask
V1
V2 Vj
64 Element Vector V3 V. Length
Registers V4 Vk
Single Port V5
V6
Memory V7
FP Add
S0 Sj FP Mul
16 banks of 64- ( (Ah) + j k m ) S1
S2 Sk FP Recip
bit words Si S3
(A0) 64 S4 Si Int Add
+ Tjk S5
T Regs S6
8-bit SECDED S7
Int Logic
Int Shift
A0
80MW/sec data ( (Ah) + j k m ) A1 Pop Cnt
load/store Ai
A2
A3
Aj
(A0) 64 A4 Ak Addr Add
Bjk A5
Ai
320MW/sec B Regs A6 Addr Mul
A7
instruction
buffer refill NIP CIP
64-bitx16
LIP
4 Instruction Buffers
memory bank cycle 50 ns processor cycle 12.5 ns (80MHz)
Vector Programming Model
g g
Scalar Registers Vector Registers

r15 v15
r0 v0
[0] [1] [2] [VLRMAX-1]
Vector Lengthh Register VLR
Vector Programming Model
g g
v1
Vector Arithmetic v2
Instructions + + + + + +
ADDV v3, v1, v2 v3
[0] [1] [VLR-1]
Vector Load and Store Vector Register

v1
Instructions
LV v1, r1, r2
Memory
Base, r1 Stride, r2
Vector Code Example
p
# C code # Scalar Code # Vector Code

for (i=0; i<64; i++) LI R4, 64 LI VLR, 64
C[i] = A[i] + B[i]; loop: LV V1, R1
L.D F0, 0(R1) LV V2, R2
L.D F2, 0(R2) ADDV.D V3, V1, V2
ADD.D F4, F2, F0 SV V3, R3
S.D F4, 0(R3)
DADDIU R1, 8
DADDIU R2, 8
DADDIU
U R3,
3, 8
DSUBIU R4, 1
BNEZ R4, loop
Vector Instruction Set Advantages
g
 Compact
 one short instruction encodes N operations
 Expressive, tells hardware that these N operations:
 are independent
i d d t
 use the same functional unit
 access disjoint registers
 access registers in the same pattern as previous instructions
 access a contiguous block of memory (unit‐stride load/store)
 access memory in a known pattern (strided load/store)
access memory in a known pattern (strided
 Scalable
 can run same object code on more parallel pipelines or lanes
Vector Arithmetic Execution
 Use deep pipeline (
Use deep pipeline (=> > fast
fast
V V V
clock) to execute element 1 2 3
operations
 Simplifies control of deep
pipeline because
elements in vector are
independent (=> no
hazards!) Six stage multiply
pipeline
V3 <- v1 * v2
Vector Memory System
y y
 ay , 6 ba s, cyc e ba busy e, cyc e a e cy
Cray‐1, 16 banks, 4 cycle bank busy time, 12 cycle latency
 Bank busy time: Cycles between accesses to same bank
Base Stride
g
Vector Registers
Address
Generator +
0 1 2 3 4 5 6 7 8 9 A B C D E F
Memory Banks
Vector Instruction Execution
ADDV C,A,B
Execution using one Execution using four

pipelined functional pipelined functional
unit units
A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]
C[2] C[8] C[9] C[10] C[11]

C[1] C[4] C[5] C[6] C[7]
C[0] C[0] C[1] C[2] C[3]

Vector Unit Structure
Functional Unit
Vector
Registers
Elements 0, Elements 1, Elements 2, Elements 3,
4,, 8,, … 5,, 9, … 6,, 10,, … 7,, 11,, …
Lane
Memory Subsystem
Vector Memory‐Memory versus
Vector Register Machines
Vector Register Machines
 Vector memory‐memory instructions hold all vector operands in main memory
 The first vector machines, CDC Star‐100 (‘73) and TI ASC (‘71), were memory‐
memory machines
 Cray‐1 (’76) was first vector register machine
Vector Memory-Memory
Memory Memory Code
Example Source Code ADDV C, A, B
for (i=0; i<N; i++) SUBV D, A, B
{
C[i] = A[i] + B[i]; Vector Register Code
D[i] = A[i] - B[i]; LV V1, A
} LV V2, B
ADDV V3, V1, V2
SV V3,
, C
SUBV V4, V1, V2
SV V4, D
Vector Memory‐Memory vs. Vector
Register Machines
Register Machines
 y y ( ) q
Vector memory‐memory architectures (VMMA) require
greater main memory bandwidth, why?
 All operands must be read in and out of memory
 VMMAs make if difficult to overlap execution of multiple
VMMA k if diffi lt t l ti f lti l
vector operations, why?
 Must check dependencies on memory addresses
p y
 VMMAs incur greater startup latency
 Scalar code was faster on CDC Star‐100 for vectors < 100
elements
l t
 For Cray‐1, vector/scalar breakeven point was around 2
elements
Automatic Code Vectorization
for (i=0; i < N; i++)
C[i] = A[i] + B[i];
Scalar Sequential Code

Vectorized Code
load
load load
Iter. 1 load
load load
Time
add
add add
store
store store
load
Iter. 2 l d
load Iter.
Iter Iter.
Iter
1 2 Vector Instruction
add Vectorization is a massive compile-time

reordering of operation sequencing
store  requires extensive loop dependence analysis
Vector Stripmining
p g
 Problem: Vector registers have finite length
 Solution: Break loops into pieces that fit into vector registers, “Stripmining”
for (i=0; i<N; i++) ANDI R1, N, 63 # N mod 64
C[i] = A[i]+B[i]; MTC1 VLR, R1 # Do remainder
A B C loop:
LV V1, RA
+ Remainder DSLL R2, R1, 3 # Multiply by 8
DADDU RA
RA, RA
RA, R2 # Bump pointer
LV V2, RB
+ 64 elements DADDU RB, RB, R2
ADDV.D V3, V1, V2
SV V3, RC
DADDU RC, RC, R2
DSUBU N, N, R1 # Subtract elements
+ LI R1, 64
MTC1 VLR, R1 # Reset full length
BGTZ N, loop # Any more to do?
Vector Instruction Parallelism
 Can overlap execution of multiple vector instructions
 example machine has 32 elements per vector register and 8 lanes
Load Unit Multiply Unit Add Unit
load
mul
add
time
load
mul
add
Instruction
issue
Complete 24 operations/cycle while issuing 1 short instruction/cycle

Vector Chainingg
 g yp g
Vector version of register bypassing
 introduced with Cray‐1
V V V V
V1
LV v1 2 3 4 5
MULV v3,v1,v2
v3 v1 v2
ADDV v5, v3, v4 Chain Chain

Load
Unit
Mult. Add
Memory
Vector Chaining Advantage
g g
 g,
Without chaining, must wait for last element of result to be
written before starting dependent instruction
Load
Mul
Time Add
 With chaining, can start dependent instruction as soon as first
h h d d f
result appears
Load
Mul
Add
Vector Startup
p
 Two components of vector startup penalty
functional unit latency (time through pipeline)
f i l i l (i h h i li )
 dead time or recovery time (time before another vector instruction can start down pipeline)
Functional Unit Latency
R X X X W
R X X X W First Vector Instruction

R X X X W
R X X X W
R X X X W
Dead Time
R X X X W
R X X X W
R X X X W
Dead Time R X X X W
Second Vector Instruction
R X X X W
Dead Time and Short Vectors
No dead time
4 cycles dead time T0, Eight lanes

No dead time
100% efficiency with 8 element
vectors
64 cycles active
Cray C90, Two lanes

4 cycle dead time
Maximum efficiency 94% with
128 element vectors
Vector Scatter/Gather
/
Want to vectorize
Want to vectorize loops with indirect accesses:
loops with indirect accesses:
for (i=0; i<N; i++)
A[i] = B[i] + C[D[i]]
Indexed load instruction (Gather)
LV vD, rD # Load indices in D vector
LVI vC,
C rC,
C vD
D # L
Load
d i
indirect
di t f
from rC
C base
b
LV vB, rB # Load B vector
ADDV.D vA, vB, vC # Do add
SV vA, rA # Store result
Vector Scatter/Gather
/
Scatter example:
Scatter example:
for (i=0; i<N; i++)
A[B[i]]++;
Is following a correct translation?

Is following a correct translation?
LV vB, rB # Load indices in B vector
, rA,
LVI vA, , vB # Gather initial A values
ADDV vA, vA, 1 # Increment
SVI vA, rA, vB # Scatter incremented values
Vector Conditional Execution
 Problem: Want to vectorize loops with conditional code:
for (i=0; i<N; i++)
 if (A[i]>0) then
 A[i] = B[i];

 Solution: Add vector mask (or flag) registers
– vector version of predicate registers, 1 bit per element
 …and maskable vector instructions
– vector operation becomes NOP at elements where mask bit is clear
vector operation becomes NOP at elements where mask bit is clear
 Code example:
 CVM # Turn on all elements
 LV vA, rA # Load entire A vector
 SGTVS.D vA, F0 # Set bits in mask register where A>0
 LV vA, rB # Load B vector into A under mask
 SV vA, rA # Store A back to memory under mask
Masked Vector Instructions
Simple Implementation Density-Time Implementation
– execute all N operations, turn off result – scan mask vector and only execute
writeback according to mask elements with non-zero masks
M[7] 1
M[7]=1 A[7] B[7] M[7] 1
M[7]=1
M[6]=0 A[6] B[6] M[6]=0 A[7] B[7]
M[5]=1 A[5] B[5] M[5]=1
M[4]=1 A[4] B[4] M[4]=1
M[3]=0 A[3] B[3] M[3]=0 C[5]
M[2]=0 C[4]
M[1]=1
M[2]=0
M[2] 0 C[2]
M[0]=0
M[1]=1 C[1] C[1]
Write data port
M[0]=0 C[0]
Write Enable Write data port
Compress/Expand Operations
p / p p
 Compress packs non‐masked elements from one vector
register contiguously at start of destination vector register
l fd
 population count of mask vector gives packed vector length
 Expand performs inverse operation
M[7]=1 A[7] A[7] A[7] M[7]=1
M[6]=0 A[6] A[5] B[6] M[6]=0
M[5]=1 A[5] A[4] A[5] M[5]=1
M[4]=1 A[4] A[1] A[4] M[4]=1
M[3]=0 A[3] A[7] B[3] M[3]=0
M[2]=0 A[2] A[5] B[2] M[2]=0
M[1]=1 A[1] A[4] A[1] M[1]=1
M[0]=0 A[0] A[1] B[0] M[0]=0
Compress Expand
Used for density-time conditionals and also for general selection operations
A Modern Vector Super: NEC SX‐6
(2003)
 CMOS Technology
 500 MHz CPU, fits on single chip
 SDRAM main memory (up to 64GB)
 Scalar unit
 4‐way superscalar with out‐of‐order and speculative execution
y p p
 64KB I‐cache and 64KB data cache
 Vector unit
 8 foreground VRegs + 64 background VRegs (256x64‐bit elements/VReg)
 1
1 multiply unit, 1 divide unit, 1 add/shift unit, 1 logical unit, 1 mask unit
lti l it 1 di id it 1 dd/ hift it 1 l i l it 1 k it
 8 lanes (8 GFLOPS peak, 16 FLOPS/cycle)
 1 load & store unit (32x8 byte accesses/cycle)
 32 GB/s memory bandwidth per processor
 SMP structure
 8 CPUs connected to memory through crossbar
 256 GB/s shared memory bandwidth (4096 interleaved banks)
Multimedia Extensions
 Very short vectors added to existing ISAs for micros
 Usually 64‐bit registers split into 2x32b or 4x16b or 8x8b
 Newer designs have 128‐bit registers (Altivec, SSE2)
 Limited instruction set:
Limited instruction set:
 no vector length control
 no strided load/store or scatter/gather
 unit‐stride loads must be aligned to 64/128‐bit boundary
it t id l d t b li d t 64/128 bit b d
 Limited vector register length:
 requires superscalar dispatch to keep multiply/add/load units
busy
 loop unrolling to hide latencies increases register pressure
 Trend towards fuller vector support in microprocessors
Graphical Processing Units
p g
 e e a d a e es ed o do g ap cs e , o
Given the hardware invested to do graphics well, how
can be supplement it to improve performance of a wider
range of applications?
 Basic idea:
 Heterogeneous execution model
 CPU is the host, GPU is the device
 Develop a C‐like programming language for GPU
Develop a C like programming language for GPU
 Unify all forms of GPU parallelism as CUDA thread
 Programming model is
Programming model is “Single
Single Instruction Multiple Thread
Instruction Multiple Thread”
Threads and Blocks
 A thread is associated with each data element
A thread is associated with each data element
 Threads are organized into blocks
 Blocks are organized into a grid
Blocks are organized into a grid
 GPU hardware handles thread management, not
GPU hardware handles thread management not
applications or OS
NVIDIA GPU Architecture
 Similarities to vector machines:
 Works well with data‐level parallel problems
 Scatter‐gather transfers
 Mask registers
k
 Large register files
 Differences:
 No scalar processor
 Uses multithreading to hide memory latency
 Has many functional units, as opposed to a few deeply
pipelined units like a vector processor
pipelined units like a vector processor
Example
p
 Multiply two vectors of length 8192
 Code that works over all elements is the grid
 Thread blocks break this down into manageable sizes
 512 threads per block
512 threads per block
 SIMD instruction executes 32 elements at a time
 Thus grid size = 16 blocks
 Block is analogous to a strip‐mined vector loop with vector
Bl k i l t ti i d t l ith t
length of 32
 Block is assigned to a multithreaded SIMD processor by the
thread block scheduler
thread block scheduler
 Current‐generation GPUs (Fermi) have 7‐15 multithreaded
SIMD processors
Terminology
gy
 f
Threads of SIMD instructions
 Each has its own PC
 Thread scheduler uses scoreboard to dispatch
 No data dependencies between threads!
d d d b h d!
 Keeps track of up to 48 threads of SIMD instructions
 Hides memory latency
Hides memory latency
 Thread block scheduler schedules blocks to SIMD
processors
 Within each SIMD processor:
 32 SIMD lanes
 Wide and shallow compared to vector processors
Wide and shallow compared to vector processors
Example
p
 NVIDIA GPU has 32,768 registers
NVIDIA GPU has 32,768 registers
 Divided into lanes
 Each SIMD thread is limited to 64 registers
g
 SIMD thread has up to:
 64 vector registers of 32 32‐bit elements
 32 vector registers of 32 64‐bit elements
 Fermi has 16 physical SIMD lanes, each containing
2048 registers
NVIDIA Instruction Set Arch.
 ISA is an abstraction of the hardware instruction set
 “Parallel Thread Execution (PTX)”
 Uses virtual registers
 Translation to machine code is performed in software
 Example:
shl.s32
shl s32 R8 blockId 9
R8, blockIdx, 9 ; Thread Block ID * Block size (512 or 29)
Thread Block ID * Block si e (512 or 29)
add.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread ID
ld.global.f64 RD0, [X+R8] ; RD0 = X[i]
ld global f64 RD2, [Y+R8]
ld.global.f64 RD2 [Y+R8] ; RD2 Y[i]
; RD2 = Y[i]
mul.f64 R0D, RD0, RD4 ; Product in RD0 = RD0 * RD4 (scalar a)
add.f64 R0D, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i])
st global f64 [Y+R8] RD0 ; Y[i] = sum
st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i])
(X[i]*a + Y[i])
Conditional Branchingg
 Like vector architectures, GPU branch hardware uses internal
masks
 Also uses
 Branch synchronization stack
y
 Entries consist of masks for each SIMD lane
 I.e. which threads commit their results (all threads execute)
 Instruction markers to manage when a branch diverges into multiple
execution paths
 Push on divergent branch
 …and when paths converge
 Act as barriers
b
 Pops stack
 Per‐thread‐lane 1‐bit predicate register, specified by programmer
NVIDIA GPU Memory Structures
y
 p p
Each SIMD Lane has private section of off‐chip DRAM
 “Private memory”
 Contains stack frame, spilling registers, and private
variables
 Each multithreaded SIMD processor also has local
memory
 Shared by SIMD lanes / threads within a block
 Memory shared by SIMD processors is GPU Memory
 Host can read and write GPU memory
Fermi Architecture Innovations
 Each SIMD processor has
 Two SIMD thread schedulers, two instruction dispatch units
 16 SIMD lanes (SIMD width=32, chime=2 cycles), 16 load‐store units, 4
special function units
 Thus, two threads of SIMD instructions are scheduled every two clock
cycles
 Fast double precision
 Caches for GPU memory
 64‐bit addressing and unified address space
 Error correcting codes
Error correcting codes
 Faster context switching
 Faster atomic instructions
Fermi Multithreaded SIMD Proc.

Slide 7

Uploaded by

Copyright:

Available Formats

Slide 7

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slide 7

Uploaded by

Copyright:

Available Formats

VECTORS COMPUTERS

Scalar Registers Vector Registers

Vector Load and Store Vector Register

# C code # Scalar Code # Vector Code

Execution using one Execution using four

C[2] C[8] C[9] C[10] C[11]

C[0] C[0] C[1] C[2] C[3]

Scalar Sequential Code

add Vectorization is a massive compile-time

Complete 24 operations/cycle while issuing 1 short instruction/cycle

ADDV v5, v3, v4 Chain Chain

R X X X W First Vector Instruction

4 cycles dead time T0, Eight lanes

Cray C90, Two lanes

Is following a correct translation?

You might also like