array & vector processor
array & vector processor
array & vector processor
Vector processors
Array processors
Cray supercomputers
Multimedia extensions
Introduction
Manipulation of arrays or vectors is a common operation in
scientific and engineering applications.
1
Vector Processor Architecture
Scalar Unit
Scalar
Instructions Scalar
Registers
Scalar
Functional Units
Instruction
Fetch and Memory
Decode unit
Vector
Registers
Vector Vector
Instructions Functional Units
Vector Unit
Memory
System
2
Vector Processors
Strictly speaking, vector processors are not parallel processors.
They only behave like SIMD computers.
There are not several CPUs in a vector processor, running in parallel.
They are SISD processors with vector instructions executed on pipelined
functional units.
Vector computers usually have vector registers which can each store
64 to 128 values.
Vector instructions examples:
Load vector from memory into vector register
Store vector into memory
Arithmetic and logic operations between vectors
Operations between vectors and scalars
The programmers are allowed to use operations on vectors in the
programs, and the compiler translates these operations into vector
instructions at machine level.
3
Vector Program
Consider an element-by-element addition of two N-element vectors A
and B to create the sum vector C.
4
IBM 3090 with Vector Facility
Very similar to
a superscalar
computer.
Little impact
on software.
Vector
processors
execute vector
instructions.
Vector processors
Array processors
Cray supercomputers
Multimedia extensions
5
Array Processors
It is composed of N identical processing elements and a number
of memory modules.
All PEs are under the control of a single control unit.
They execute instruction in a lock-step mode.
PE1
M1
Interconnection Network
PE2 M2
Control IS
Unit
I/O
...
...
System
PEn
Mk
Shared
Memory
6
Array Processor Classification
Processing element complexity
Single-bit processors
• Connection Machine (CM-2) 65536 PEs connected by
a hypercube network (by Thinking Machine Corporation).
Multi-bit processors
• ILLIAC IV (64-bit), MasPar MP-1 (32-bit)
Processor-memory interconnection
Dedicated memory organization
• ILLIAC IV, CM-2, MP-1
Global memory organization
• Bulk Synchronous Parallel (BSP) computer
PE1
M1
Interconnection Network
PE2 M2
Control IS
Unit
I/O
...
...
System
PEn
Mk
Shared
Memory
7
Dedicated Memory Organization
Interconnection Network
PE1 M1
PE2 M2
Control IS
Unit
...
PEn Mn
I/O
System
8
An Example
To compute N
Y=
i=1
A(i) * B(i)
Assuming:
A dedicated memory organization.
Elements of A and B are properly and perfectly distributed
among processors (the compiler can help here).
We have:
The product terms are computed in parallel.
Additions can be done in log2N iterations in a pair-wise manner.
Speed up factor (assuming that addition and multiplication take
the same time):
S= 2 N-1
1 + lo g 2 N
ILLIAC IV
ILLIAC IV is a classical example of Array Processors.
A typical SIMD computer for array processing.
64 Processing Elements (PEs), each with its local
memory.
One single Control Unit (CU).
CU can access all memory.
PEs can access local memory and communicate with
neighbors.
CU reads program and broadcasts instructions to PEs.
9
ILLIAC IV Architecture
Vector processors
Array processors
Cray supercomputers
Multimedia extensions
10
Cray X1: Parallel Vector Machine
Cray combines several technologies in the X1 machine (2003):
12.8 Gflop/s high-performnace vector processors.
Shared caches.
• 4 processor nodes sharing 2 MB cache, and up to 64 GB of
memory.
Multi-streaming vector processing.
Multiple node architecture.
custom
12.8 Gflops (64 bit) blocks
S S S S
25.6 Gflops (32 bit)
V V V V V V V V
51 GB/s load
25-41 GB/s store
2 MB cache
0.5 MB 0.5 MB 0.5 MB 0.5 MB shared caches
$ $ $ $
11
Cray X1: Node
P P P P P P P P P P P P P P P P
$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $
M M M M M M M M M M M M M M M M
mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem
IO IO
Shared memory
32 network links and four I/O links per node
R R R R
R R R R
Fast Switch
12
Cray X1: Parallelism
Many levels of parallelism
Within a processor: vectorization.
Within an MSP: streaming.
Within a node: shared memory.
Across nodes: message passing.
13
Lecture 9: SIMD Architectures
Vector processors
Array processors
Cray supercomputers
Multimedia extensions
Multimedia Extensions
How do we extend general purpose microprocessors so that they
can handle multimedia applications efficiently?
Solutions:
New generations of general purpose microprocessors are
equipped with special instructions to exploit this parallelism.
The specialized multimedia instructions perform vector
computations on bytes, half-words, or words.
14
Special Instructions
Several vendors have extended the instruction set of their
processors in order to improve performance with multimedia
applications:
MMX for Intel x86 family;
VIS for UltraSparc;
MDMX for MIPS; and
MAX-2 for Hewlett-Packard PA-RISC.
Implementation
The basic idea: sub-word execution
Use the entire width of a processor data path (e.g., 64
bits), when processing small data (8, 12, or 16 bits).
With word size 64 bits, an adder can be used to
implement eight 8-bit additions in parallel.
MMX technology allows a single instruction to work on
multiple pieces of data.
Consequently we have practically a kind of SIMD
parallelism, at a reduced scale and with very low cost.
15
Packed Data Types
Three packed data types are defined for parallel operations:
packed byte, packed word, packed double word.
Packed byte
q7 q6 q5 q4 q3 q2 q1 q0
Packed word
q3 q2 q1 q0
Quad word
q0
64 bits
MULADD R3 R1, R2
R1 a7 a6 a5 a4 a3 a2 a1 a0
×&+ ×&+ ×&+ ×&+ ×&+ ×&+ ×&+ ×&+
R2 b7 b6 b5 b4 b3 b2 b1 b0
= = = = = = = =
R3 (a6×b6)+(a7×b7) (a4×b4)+(a5×b5) (a2×b2)+(a3×b3) (a0×b0)+(a1×b1)
16
Performance Comparison
The following shows the performance of Pentium processors
(32-bit machine) with and without MMX technology:
Summary
Vector processors are SISD processors which include in their
instruction set instructions operations on vectors.
They are implemented using pipelined functional units.
They behave like SIMD machines.
Array processors, being typical SIMD, execute the same
operation on a set of interconnected processing units.
Both vector and array processors are specialized for numerical
problems expressed in matrix or vector formats.
They are usually integrated inside a large computer.
Many modern architectures deploy usually several parallel
architecture concepts at the same time, such as Cray X1.
Multimedia applications exhibit a large potential of SIMD
parallelism, which can be implemented by extending the
traditional SISD architecture.
17