Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

array & vector processor

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

SIMD Architectures

 Vector processors

 Array processors

 Cray supercomputers

 Multimedia extensions

Introduction
 Manipulation of arrays or vectors is a common operation in
scientific and engineering applications.

 Typical operations of array-oriented data include:


 Processing one or more vectors to produce a scalar result.
 Combining two vectors to produce a third one.
 Combining a scalar and a vector to generate a vector.
 A combination of the above three operations.

 Two architectures suitable for vector processing are:


 Pipelined vector processors
• Implemented in many supercomputers
 Parallel array processors

 Compiler does some of the difficult work of finding parallelism,


so the hardware doesn’t have to.
 Data parallelism.

1
Vector Processor Architecture
Scalar Unit
Scalar
Instructions Scalar
Registers

Scalar
Functional Units
Instruction
Fetch and Memory
Decode unit

Vector
Registers

Vector Vector
Instructions Functional Units

Vector Unit

Vector Unit Operation Model


Vector Registers
Pipelined ALU

Memory
System

2
Vector Processors
 Strictly speaking, vector processors are not parallel processors.
 They only behave like SIMD computers.
 There are not several CPUs in a vector processor, running in parallel.
 They are SISD processors with vector instructions executed on pipelined
functional units.
 Vector computers usually have vector registers which can each store
64 to 128 values.
 Vector instructions examples:
 Load vector from memory into vector register
 Store vector into memory
 Arithmetic and logic operations between vectors
 Operations between vectors and scalars
 The programmers are allowed to use operations on vectors in the
programs, and the compiler translates these operations into vector
instructions at machine level.

The Vector Unit


 A vector unit consists of a pipelined functional unit, which perform
ALU operation of vectors in a pipeline.

 It has also vector registers, including:


 A set of general purpose vector registers, each of length s (e.g., 128);
 A vector length register VL, which stores the length l (0  l  s) of the
currently processed vector(s);
 A mask register M, which stores a set of l bits, one for each element
in a vector, interpreted as Boolean values;
• Vector instructions can be executed in masked mode so that
vector elements corresponding to a false value in M are ignored.
8 VL 8
VR2 … …
M 1 1 0 1 1 1 0 1 …

3
Vector Program
 Consider an element-by-element addition of two N-element vectors A
and B to create the sum vector C.

 On an SISD machine, this computation will be implemented as:


for i = 0 to N-1 do
C[i] := A[i] + B[i];
 There will be N*K instruction fetches (assuming that K instructions are
needed for each iteration) and N additions.
 There will also be N conditional branches, if loop unrolling is not used.

 A compiler for a vector computer generates something like:


C[0:N-1]  A[0:N-1] + B[0:N-1];
 Even though N additions will still be performed, there will only be K’
instruction fetches (e.g., Load A, Load B, Add_vector, Write C  4
instructions).
 No conditional branch is needed.

Features of Vector Processors


 Advantages:
 Quick fetch and decode of a single instruction for multiple operations.
 The instruction provides a regular source of data, which arrive at
each cycle, and can be processed in a pipelined fashion efficiently.
 The compiler can do the work for you.

 Memory-to-memory operation mode:


 No registers are needed.
 It can process very long vectors; but startup time is large.
 It appeared in the 70’s and died in the 80’s.

 Register-to-register operations are more popular now:


 Operations are performed to values stored in the vector registers.

 They are usually part of a supercomputer or a mainframe.

4
IBM 3090 with Vector Facility
 Very similar to
a superscalar
computer.

 Little impact
on software.

 Vector
processors
execute vector
instructions.

Lecture 9: SIMD Architectures

 Vector processors

 Array processors

 Cray supercomputers

 Multimedia extensions

5
Array Processors
 It is composed of N identical processing elements and a number
of memory modules.
 All PEs are under the control of a single control unit.
 They execute instruction in a lock-step mode.

 Processing units and memory elements communicate with each


other through an interconnection network.
 Different topologies can be used.

 Complexity of the control unit is at the same level of the uni-


processor system.
 Control unit is usually itself a computer with its own high speed
registers, local memory and ALU.
 The main memory is the collection of the memory modules.

Global Memory Organization

PE1
M1
Interconnection Network

PE2 M2
Control IS
Unit
I/O
...
...

System

PEn
Mk
Shared
Memory

6
Array Processor Classification
 Processing element complexity
 Single-bit processors
• Connection Machine (CM-2)  65536 PEs connected by
a hypercube network (by Thinking Machine Corporation).
 Multi-bit processors
• ILLIAC IV (64-bit), MasPar MP-1 (32-bit)

 Processor-memory interconnection
 Dedicated memory organization
• ILLIAC IV, CM-2, MP-1
 Global memory organization
• Bulk Synchronous Parallel (BSP) computer

Global Memory Organization

PE1
M1
Interconnection Network

PE2 M2
Control IS
Unit
I/O
...
...

System

PEn
Mk
Shared
Memory

7
Dedicated Memory Organization

Interconnection Network
PE1 M1

PE2 M2
Control IS
Unit

...
PEn Mn
I/O
System

Features of Array Processors


 Control and scalar type instructions are executed in the
control unit.
 Vector instructions are performed in the processing
elements.
 Data organization and detection of parallelism in a
program are major issues when using such architecture.
 Operations such as C(i) = A(i) × B(i), 1  i  n could be
executed in parallel, if the elements of the arrays A and
B are distributed properly among the processors or
memory modules.
 Ex. PEi is assigned the task of computing C(i).
 In the ideal case, both have the same dimension.

8
An Example
To compute N
Y= 
i=1
A(i) * B(i)

Assuming:
 A dedicated memory organization.
 Elements of A and B are properly and perfectly distributed
among processors (the compiler can help here).
We have:
 The product terms are computed in parallel.
 Additions can be done in log2N iterations in a pair-wise manner.
 Speed up factor (assuming that addition and multiplication take
the same time):

S= 2 N-1
1 + lo g 2 N

ILLIAC IV
 ILLIAC IV is a classical example of Array Processors.
 A typical SIMD computer for array processing.
 64 Processing Elements (PEs), each with its local
memory.
 One single Control Unit (CU).
 CU can access all memory.
 PEs can access local memory and communicate with
neighbors.
 CU reads program and broadcasts instructions to PEs.

9
ILLIAC IV Architecture

Lecture 9: SIMD Architectures

 Vector processors

 Array processors

 Cray supercomputers

 Multimedia extensions

10
Cray X1: Parallel Vector Machine
 Cray combines several technologies in the X1 machine (2003):
 12.8 Gflop/s high-performnace vector processors.
 Shared caches.
• 4 processor nodes sharing 2 MB cache, and up to 64 GB of
memory.
 Multi-streaming vector processing.
 Multiple node architecture.

Cray X1: Building Block


 MSP: Multi-Streaming vector Processor
 Formed by 4 SSPs (each a 2-pipe vector processor).
 Balance computations across SSPs.
 Compiler will try to vectorize/parallelize across the MSP,
achieving “streaming.”

custom
12.8 Gflops (64 bit) blocks
S S S S
25.6 Gflops (32 bit)
V V V V V V V V

51 GB/s load
25-41 GB/s store

2 MB cache
0.5 MB 0.5 MB 0.5 MB 0.5 MB shared caches
$ $ $ $

To local memory and network: Figure source J. Levesque, Cray

11
Cray X1: Node

P P P P P P P P P P P P P P P P

$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $

M M M M M M M M M M M M M M M M
mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem

IO IO

 Shared memory
 32 network links and four I/O links per node

Cray X1: 32 Nodes

R R R R

R R R R

Fast Switch

12
Cray X1: Parallelism
 Many levels of parallelism
 Within a processor: vectorization.
 Within an MSP: streaming.
 Within a node: shared memory.
 Across nodes: message passing.

 Some are automated by the compiler, some require work by the


programmer:
 This is a common trend.
 The more complex the architecture, the more difficult it is for the
programmer to exploit it.

 Hard to fit this machine into a simple taxonomy!

Most Powerful Supercomputer - Titan


 Ranked 1st in the world on November 12, 2012.
 Developed by Cray Inc., and became operational in
October 2012.
 Performance: 17.59 petaFLOPS (1015).
 Memory size: 710 terabytes (1012).
 The latest trends in supercomputing:
 18,688 AMD Opteron 6274 16-core CPUs, running at
2.2 GHz.
 18,688 Nvidia Tesla K20 GPUs, each containing 2,496
CUDA cores running at 732 MHz.
 Cost was estimated to be $97 million.

13
Lecture 9: SIMD Architectures

 Vector processors

 Array processors

 Cray supercomputers

 Multimedia extensions

Multimedia Extensions
How do we extend general purpose microprocessors so that they
can handle multimedia applications efficiently?

Analysis of the need:


 Video and audio applications very often deal with large arrays of
small data types (8 or 16 bits).
 Such applications exhibit a large potential of SIMD (vector)
parallelism.
 Data parallelism.

Solutions:
 New generations of general purpose microprocessors are
equipped with special instructions to exploit this parallelism.
 The specialized multimedia instructions perform vector
computations on bytes, half-words, or words.

14
Special Instructions
 Several vendors have extended the instruction set of their
processors in order to improve performance with multimedia
applications:
 MMX for Intel x86 family;
 VIS for UltraSparc;
 MDMX for MIPS; and
 MAX-2 for Hewlett-Packard PA-RISC.

 The Pentium line provides 57 MMX instructions, which treat data


in a SIMD fashion to improve the performance of:
 Computer-aided design;
 Internet application;
 Computer visualization;
 Video games; and
 Speech recognition.

Implementation
The basic idea: sub-word execution
 Use the entire width of a processor data path (e.g., 64
bits), when processing small data (8, 12, or 16 bits).
 With word size 64 bits, an adder can be used to
implement eight 8-bit additions in parallel.
 MMX technology allows a single instruction to work on
multiple pieces of data.
 Consequently we have practically a kind of SIMD
parallelism, at a reduced scale and with very low cost.

15
Packed Data Types
 Three packed data types are defined for parallel operations:
packed byte, packed word, packed double word.

Packed byte
q7 q6 q5 q4 q3 q2 q1 q0

Packed word
q3 q2 q1 q0

Packed double word


q1 q0

Quad word
q0

64 bits

SIMD Arithmetic Examples


ADD R3  R1, R2
R1 a7 a6 a5 a4 a3 a2 a1 a0
+ + + + + + + +
R2 b7 b6 b5 b4 b3 b2 b1 b0
= = = = = = = =
R3 a7+b7 a6+b6 a5+b5 a4+b4 a3+b3 a2+b2 a1+b1 a0+b0

MULADD R3  R1, R2
R1 a7 a6 a5 a4 a3 a2 a1 a0
×&+ ×&+ ×&+ ×&+ ×&+ ×&+ ×&+ ×&+
R2 b7 b6 b5 b4 b3 b2 b1 b0
= = = = = = = =
R3 (a6×b6)+(a7×b7) (a4×b4)+(a5×b5) (a2×b2)+(a3×b3) (a0×b0)+(a1×b1)

16
Performance Comparison
 The following shows the performance of Pentium processors
(32-bit machine) with and without MMX technology:

Application Without With Speedup


MMX MMX
Video 155.52 268.70 1.72
Image 159.03 743.90 4.67
Processing
3D geometry 161.52 166.44 1.03
Audio 149.80 318.90 2.13
OVERALL 156.00 255.43 1.64

Summary
 Vector processors are SISD processors which include in their
instruction set instructions operations on vectors.
 They are implemented using pipelined functional units.
 They behave like SIMD machines.
 Array processors, being typical SIMD, execute the same
operation on a set of interconnected processing units.
 Both vector and array processors are specialized for numerical
problems expressed in matrix or vector formats.
 They are usually integrated inside a large computer.
 Many modern architectures deploy usually several parallel
architecture concepts at the same time, such as Cray X1.
 Multimedia applications exhibit a large potential of SIMD
parallelism, which can be implemented by extending the
traditional SISD architecture.

17

You might also like