0% found this document useful (0 votes)

35 views

array & vector processor

Uploaded by

Susanta Kumar Sahoo

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views

array & vector processor

Uploaded by

Susanta Kumar Sahoo

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

SIMD Architectures

 Vector processors

 Array processors

 Cray supercomputers

 Multimedia extensions

Introduction
 Manipulation of arrays or vectors is a common operation in
scientific and engineering applications.

 Typical operations of array-oriented data include:

 Processing one or more vectors to produce a scalar result.
 Combining two vectors to produce a third one.
 Combining a scalar and a vector to generate a vector.
 A combination of the above three operations.

 Two architectures suitable for vector processing are:

 Pipelined vector processors
• Implemented in many supercomputers
 Parallel array processors

 Compiler does some of the difficult work of finding parallelism,

so the hardware doesn’t have to.
 Data parallelism.

1
Vector Processor Architecture
Scalar Unit
Scalar
Instructions Scalar
Registers

Scalar
Functional Units
Instruction
Fetch and Memory
Decode unit

Vector
Registers

Vector Vector
Instructions Functional Units

Vector Unit

Vector Unit Operation Model

Vector Registers
Pipelined ALU

Memory
System

2
Vector Processors
 Strictly speaking, vector processors are not parallel processors.
 They only behave like SIMD computers.
 There are not several CPUs in a vector processor, running in parallel.
 They are SISD processors with vector instructions executed on pipelined
functional units.
 Vector computers usually have vector registers which can each store
64 to 128 values.
 Vector instructions examples:
 Load vector from memory into vector register
 Store vector into memory
 Arithmetic and logic operations between vectors
 Operations between vectors and scalars
 The programmers are allowed to use operations on vectors in the
programs, and the compiler translates these operations into vector
instructions at machine level.

The Vector Unit

 A vector unit consists of a pipelined functional unit, which perform
ALU operation of vectors in a pipeline.

 It has also vector registers, including:

 A set of general purpose vector registers, each of length s (e.g., 128);
 A vector length register VL, which stores the length l (0  l  s) of the
currently processed vector(s);
 A mask register M, which stores a set of l bits, one for each element
in a vector, interpreted as Boolean values;
• Vector instructions can be executed in masked mode so that
vector elements corresponding to a false value in M are ignored.
8 VL 8
VR2 … …
M 1 1 0 1 1 1 0 1 …

3
Vector Program
 Consider an element-by-element addition of two N-element vectors A
and B to create the sum vector C.

 On an SISD machine, this computation will be implemented as:

for i = 0 to N-1 do
C[i] := A[i] + B[i];
 There will be N*K instruction fetches (assuming that K instructions are
needed for each iteration) and N additions.
 There will also be N conditional branches, if loop unrolling is not used.

 A compiler for a vector computer generates something like:

C[0:N-1]  A[0:N-1] + B[0:N-1];
 Even though N additions will still be performed, there will only be K’
instruction fetches (e.g., Load A, Load B, Add_vector, Write C  4
instructions).
 No conditional branch is needed.

Features of Vector Processors

 Advantages:
 Quick fetch and decode of a single instruction for multiple operations.
 The instruction provides a regular source of data, which arrive at
each cycle, and can be processed in a pipelined fashion efficiently.
 The compiler can do the work for you.

 Memory-to-memory operation mode:

 No registers are needed.
 It can process very long vectors; but startup time is large.
 It appeared in the 70’s and died in the 80’s.

 Register-to-register operations are more popular now:

 Operations are performed to values stored in the vector registers.

 They are usually part of a supercomputer or a mainframe.

4
IBM 3090 with Vector Facility
 Very similar to
a superscalar
computer.

 Little impact
on software.

 Vector
processors
execute vector
instructions.

Lecture 9: SIMD Architectures

 Vector processors

 Array processors

 Cray supercomputers

 Multimedia extensions

5
Array Processors
 It is composed of N identical processing elements and a number
of memory modules.
 All PEs are under the control of a single control unit.
 They execute instruction in a lock-step mode.

 Processing units and memory elements communicate with each

other through an interconnection network.
 Different topologies can be used.

 Complexity of the control unit is at the same level of the uni-

processor system.
 Control unit is usually itself a computer with its own high speed
registers, local memory and ALU.
 The main memory is the collection of the memory modules.

Global Memory Organization

PE1
M1
Interconnection Network

PE2 M2
Control IS
Unit
I/O
...
...

System

PEn
Mk
Shared
Memory

6
Array Processor Classification
 Processing element complexity
 Single-bit processors
• Connection Machine (CM-2)  65536 PEs connected by
a hypercube network (by Thinking Machine Corporation).
 Multi-bit processors
• ILLIAC IV (64-bit), MasPar MP-1 (32-bit)

 Processor-memory interconnection
 Dedicated memory organization
• ILLIAC IV, CM-2, MP-1
 Global memory organization
• Bulk Synchronous Parallel (BSP) computer

Global Memory Organization

PE1
M1
Interconnection Network

PE2 M2
Control IS
Unit
I/O
...
...

System

PEn
Mk
Shared
Memory

7
Dedicated Memory Organization

Interconnection Network
PE1 M1

PE2 M2
Control IS
Unit

...
PEn Mn
I/O
System

Features of Array Processors

 Control and scalar type instructions are executed in the
control unit.
 Vector instructions are performed in the processing
elements.
 Data organization and detection of parallelism in a
program are major issues when using such architecture.
 Operations such as C(i) = A(i) × B(i), 1  i  n could be
executed in parallel, if the elements of the arrays A and
B are distributed properly among the processors or
memory modules.
 Ex. PEi is assigned the task of computing C(i).
 In the ideal case, both have the same dimension.

8
An Example
To compute N
Y= 
i=1
A(i) * B(i)

Assuming:
 A dedicated memory organization.
 Elements of A and B are properly and perfectly distributed
among processors (the compiler can help here).
We have:
 The product terms are computed in parallel.
 Additions can be done in log2N iterations in a pair-wise manner.
 Speed up factor (assuming that addition and multiplication take
the same time):

S= 2 N-1
1 + lo g 2 N

ILLIAC IV
 ILLIAC IV is a classical example of Array Processors.
 A typical SIMD computer for array processing.
 64 Processing Elements (PEs), each with its local
memory.
 One single Control Unit (CU).
 CU can access all memory.
 PEs can access local memory and communicate with
neighbors.
 CU reads program and broadcasts instructions to PEs.

9
ILLIAC IV Architecture

Lecture 9: SIMD Architectures

 Vector processors

 Array processors

 Cray supercomputers

 Multimedia extensions

10
Cray X1: Parallel Vector Machine
 Cray combines several technologies in the X1 machine (2003):
 12.8 Gflop/s high-performnace vector processors.
 Shared caches.
• 4 processor nodes sharing 2 MB cache, and up to 64 GB of
memory.
 Multi-streaming vector processing.
 Multiple node architecture.

Cray X1: Building Block

 MSP: Multi-Streaming vector Processor
 Formed by 4 SSPs (each a 2-pipe vector processor).
 Balance computations across SSPs.
 Compiler will try to vectorize/parallelize across the MSP,
achieving “streaming.”

custom
12.8 Gflops (64 bit) blocks
S S S S
25.6 Gflops (32 bit)
V V V V V V V V

51 GB/s load
25-41 GB/s store

2 MB cache
0.5 MB 0.5 MB 0.5 MB 0.5 MB shared caches
$ $ $ $

To local memory and network: Figure source J. Levesque, Cray

11
Cray X1: Node

P P P P P P P P P P P P P P P P

$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $

M M M M M M M M M M M M M M M M
mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem

IO IO

 Shared memory
 32 network links and four I/O links per node

Cray X1: 32 Nodes

R R R R

Fast Switch

12
Cray X1: Parallelism
 Many levels of parallelism
 Within a processor: vectorization.
 Within an MSP: streaming.
 Within a node: shared memory.
 Across nodes: message passing.

 Some are automated by the compiler, some require work by the

programmer:
 This is a common trend.
 The more complex the architecture, the more difficult it is for the
programmer to exploit it.

 Hard to fit this machine into a simple taxonomy!

Most Powerful Supercomputer - Titan

 Ranked 1st in the world on November 12, 2012.
 Developed by Cray Inc., and became operational in
October 2012.
 Performance: 17.59 petaFLOPS (1015).
 Memory size: 710 terabytes (1012).
 The latest trends in supercomputing:
 18,688 AMD Opteron 6274 16-core CPUs, running at
2.2 GHz.
 18,688 Nvidia Tesla K20 GPUs, each containing 2,496
CUDA cores running at 732 MHz.
 Cost was estimated to be $97 million.

13
Lecture 9: SIMD Architectures

 Vector processors

 Array processors

 Cray supercomputers

 Multimedia extensions

Multimedia Extensions
How do we extend general purpose microprocessors so that they
can handle multimedia applications efficiently?

Analysis of the need:

 Video and audio applications very often deal with large arrays of
small data types (8 or 16 bits).
 Such applications exhibit a large potential of SIMD (vector)
parallelism.
 Data parallelism.

Solutions:
 New generations of general purpose microprocessors are
equipped with special instructions to exploit this parallelism.
 The specialized multimedia instructions perform vector
computations on bytes, half-words, or words.

14
Special Instructions
 Several vendors have extended the instruction set of their
processors in order to improve performance with multimedia
applications:
 MMX for Intel x86 family;
 VIS for UltraSparc;
 MDMX for MIPS; and
 MAX-2 for Hewlett-Packard PA-RISC.

 The Pentium line provides 57 MMX instructions, which treat data

in a SIMD fashion to improve the performance of:
 Computer-aided design;
 Internet application;
 Computer visualization;
 Video games; and
 Speech recognition.

Implementation
The basic idea: sub-word execution
 Use the entire width of a processor data path (e.g., 64
bits), when processing small data (8, 12, or 16 bits).
 With word size 64 bits, an adder can be used to
implement eight 8-bit additions in parallel.
 MMX technology allows a single instruction to work on
multiple pieces of data.
 Consequently we have practically a kind of SIMD
parallelism, at a reduced scale and with very low cost.

15
Packed Data Types
 Three packed data types are defined for parallel operations:
packed byte, packed word, packed double word.

Packed byte
q7 q6 q5 q4 q3 q2 q1 q0

Packed word
q3 q2 q1 q0

Packed double word

q1 q0

Quad word
q0

64 bits

SIMD Arithmetic Examples

ADD R3  R1, R2
R1 a7 a6 a5 a4 a3 a2 a1 a0
+ + + + + + + +
R2 b7 b6 b5 b4 b3 b2 b1 b0
= = = = = = = =
R3 a7+b7 a6+b6 a5+b5 a4+b4 a3+b3 a2+b2 a1+b1 a0+b0

MULADD R3  R1, R2
R1 a7 a6 a5 a4 a3 a2 a1 a0
×&+ ×&+ ×&+ ×&+ ×&+ ×&+ ×&+ ×&+
R2 b7 b6 b5 b4 b3 b2 b1 b0
= = = = = = = =
R3 (a6×b6)+(a7×b7) (a4×b4)+(a5×b5) (a2×b2)+(a3×b3) (a0×b0)+(a1×b1)

16
Performance Comparison
 The following shows the performance of Pentium processors
(32-bit machine) with and without MMX technology:

Application Without With Speedup

MMX MMX
Video 155.52 268.70 1.72
Image 159.03 743.90 4.67
Processing
3D geometry 161.52 166.44 1.03
Audio 149.80 318.90 2.13
OVERALL 156.00 255.43 1.64

Summary
 Vector processors are SISD processors which include in their
instruction set instructions operations on vectors.
 They are implemented using pipelined functional units.
 They behave like SIMD machines.
 Array processors, being typical SIMD, execute the same
operation on a set of interconnected processing units.
 Both vector and array processors are specialized for numerical
problems expressed in matrix or vector formats.
 They are usually integrated inside a large computer.
 Many modern architectures deploy usually several parallel
architecture concepts at the same time, such as Cray X1.
 Multimedia applications exhibit a large potential of SIMD
parallelism, which can be implemented by extending the
traditional SISD architecture.

CNC 3040 800W Manual
No ratings yet
CNC 3040 800W Manual
35 pages
26-27 SIMD Architecture
No ratings yet
26-27 SIMD Architecture
33 pages
onur-digitaldesign-2020-lecture19-simd-beforelecture
No ratings yet
onur-digitaldesign-2020-lecture19-simd-beforelecture
64 pages
Advanced Computer Architecture: Presented By, Krishna
No ratings yet
Advanced Computer Architecture: Presented By, Krishna
35 pages
Lec 18-VectorSIMDGPUArchitectures
No ratings yet
Lec 18-VectorSIMDGPUArchitectures
29 pages
Chapter 8
No ratings yet
Chapter 8
59 pages
SIMD and Associative Computational Models: Parallel & Distributed Algorithms
No ratings yet
SIMD and Associative Computational Models: Parallel & Distributed Algorithms
31 pages
Zareen 6
No ratings yet
Zareen 6
11 pages
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
58 pages
Advanced Computer Architecture: Presented By, Farhan Mukhtiar
No ratings yet
Advanced Computer Architecture: Presented By, Farhan Mukhtiar
9 pages
Onur 447 Spring15 Lecture14 Simd Afterlecture
No ratings yet
Onur 447 Spring15 Lecture14 Simd Afterlecture
60 pages
CS7103 - MultiCore Architecture Ppts Unit-II
No ratings yet
CS7103 - MultiCore Architecture Ppts Unit-II
43 pages
Computer_ARCHITECTURE_Lecture_8_10_1738846483
No ratings yet
Computer_ARCHITECTURE_Lecture_8_10_1738846483
202 pages
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
28 pages
Slide 7
No ratings yet
Slide 7
40 pages
CA Classes-236-240
No ratings yet
CA Classes-236-240
5 pages
Architecture Chapter4 E5 2012
No ratings yet
Architecture Chapter4 E5 2012
92 pages
ACA1
No ratings yet
ACA1
29 pages
S 8 Mod 1
No ratings yet
S 8 Mod 1
33 pages
CA Classes-201-205
No ratings yet
CA Classes-201-205
5 pages
CA 4 notes
No ratings yet
CA 4 notes
34 pages
onur-digitaldesign-2020-lecture20-gpu-beforelecture
No ratings yet
onur-digitaldesign-2020-lecture20-gpu-beforelecture
73 pages
Paralelismo_2024
No ratings yet
Paralelismo_2024
30 pages
7TH_UNIT 4-21EC74H6_CA
No ratings yet
7TH_UNIT 4-21EC74H6_CA
67 pages
EE6304 Lecture13 Processors
No ratings yet
EE6304 Lecture13 Processors
69 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
Chapter 04
No ratings yet
Chapter 04
47 pages
Lecture 10 - SIMD Architecture
No ratings yet
Lecture 10 - SIMD Architecture
27 pages
BCSE412L - Parallel Computing 04
No ratings yet
BCSE412L - Parallel Computing 04
9 pages
MCA - HW - Lecture 7and8 - Prelim
No ratings yet
MCA - HW - Lecture 7and8 - Prelim
146 pages
Vector Processor
No ratings yet
Vector Processor
83 pages
Advance Computer Architecture2
No ratings yet
Advance Computer Architecture2
36 pages
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
50 pages
Advanced Computer Architecture Assigment
No ratings yet
Advanced Computer Architecture Assigment
60 pages
Parallel Architectures Parallel Architectures: Ever Faster
No ratings yet
Parallel Architectures Parallel Architectures: Ever Faster
11 pages
GUC_315_61_38694_2023-11-23T11_50_52
No ratings yet
GUC_315_61_38694_2023-11-23T11_50_52
33 pages
Baker CHPT 5 SIMD Good
No ratings yet
Baker CHPT 5 SIMD Good
94 pages
1/1 Multiprocessors (Or) Shared Memory Multi-Processor Model
No ratings yet
1/1 Multiprocessors (Or) Shared Memory Multi-Processor Model
17 pages
Copy of Unit IV CA
No ratings yet
Copy of Unit IV CA
73 pages
module-4-chapter-2
No ratings yet
module-4-chapter-2
42 pages
Data-Level Parallelism Vector and GPU
No ratings yet
Data-Level Parallelism Vector and GPU
6 pages
Design by Mohammed Intekhab Khan
No ratings yet
Design by Mohammed Intekhab Khan
33 pages
PowerPoint Slides To Chapter 07
No ratings yet
PowerPoint Slides To Chapter 07
49 pages
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
No ratings yet
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
43 pages
04 Hardware
No ratings yet
04 Hardware
109 pages
Flynn'S Classification: Cs6303 Computer Architecture
No ratings yet
Flynn'S Classification: Cs6303 Computer Architecture
11 pages
20250324115011
No ratings yet
20250324115011
8 pages
CA Classes-211-215
No ratings yet
CA Classes-211-215
5 pages
Advanced Computer Architecture: The Architecture of Parallel Computers
No ratings yet
Advanced Computer Architecture: The Architecture of Parallel Computers
44 pages
Advanced Computer Architecture: The Architecture of Parallel Computers
No ratings yet
Advanced Computer Architecture: The Architecture of Parallel Computers
44 pages
SIMD Architecture
100% (1)
SIMD Architecture
16 pages
CH 2 Vector Processing
No ratings yet
CH 2 Vector Processing
16 pages
l22 Vector
No ratings yet
l22 Vector
32 pages
Lecture 3 Flynn's Classical Taxonomy
No ratings yet
Lecture 3 Flynn's Classical Taxonomy
29 pages
Aca Unit 1.1
No ratings yet
Aca Unit 1.1
20 pages
Week 4 PDC
No ratings yet
Week 4 PDC
11 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
atII Bks Lec 2021 28
No ratings yet
atII Bks Lec 2021 28
6 pages
Next-Generation switching OS configuration and management: Troubleshooting NX-OS in Enterprise Environments
From Everand
Next-Generation switching OS configuration and management: Troubleshooting NX-OS in Enterprise Environments
Mamta Devi
No ratings yet
Digital Electronics, Computer Architecture and Microprocessor Design Principles
From Everand
Digital Electronics, Computer Architecture and Microprocessor Design Principles
Jagdish Krishanlal Arora
No ratings yet
Parallel Computing on Heterogeneous Networks
From Everand
Parallel Computing on Heterogeneous Networks
Alexey L. Lastovetsky
No ratings yet
Addressing Modes_all
No ratings yet
Addressing Modes_all
2 pages
8051 Tutorial_ Addressing Modes - 8052
No ratings yet
8051 Tutorial_ Addressing Modes - 8052
3 pages
Distributed memory architecture
No ratings yet
Distributed memory architecture
16 pages
Pipe Lining
No ratings yet
Pipe Lining
7 pages
IntroDynamicNetworks
No ratings yet
IntroDynamicNetworks
77 pages
Arithmetic Pipeline
No ratings yet
Arithmetic Pipeline
14 pages
Timetable 1 Term 3 2023
No ratings yet
Timetable 1 Term 3 2023
11 pages
SAMSUNG QA65Q65DAK Troubleshooting
No ratings yet
SAMSUNG QA65Q65DAK Troubleshooting
32 pages
Computer and Its Components
No ratings yet
Computer and Its Components
9 pages
HikCentral Professional - System Requirements and Performance - V2.3 - 20220713
No ratings yet
HikCentral Professional - System Requirements and Performance - V2.3 - 20220713
25 pages
Diploma CSE Syllabus 3rd
No ratings yet
Diploma CSE Syllabus 3rd
23 pages
Configuring A Raspberry Pi As An Intellivision Retro Gaming Console - Retronic DesignRetronic Design
No ratings yet
Configuring A Raspberry Pi As An Intellivision Retro Gaming Console - Retronic DesignRetronic Design
8 pages
ADAS Troubleshooting: Fault Isolation For The ADAS
No ratings yet
ADAS Troubleshooting: Fault Isolation For The ADAS
7 pages
m10 Handbook PDF
No ratings yet
m10 Handbook PDF
599 pages
Mifare Application Programming Guide
No ratings yet
Mifare Application Programming Guide
22 pages
Ict Practical No 3
No ratings yet
Ict Practical No 3
5 pages
Case Study of RTOS and Embedded Devices
No ratings yet
Case Study of RTOS and Embedded Devices
29 pages
Atmel 44065 Execute in Place XIP With Quad SPI Interface SAM V7 SAM E7 SAM S7 - Application Note
No ratings yet
Atmel 44065 Execute in Place XIP With Quad SPI Interface SAM V7 SAM E7 SAM S7 - Application Note
35 pages
Ques. What Are The Steps For The Configuration of PCMCIA Card?
No ratings yet
Ques. What Are The Steps For The Configuration of PCMCIA Card?
2 pages
Practical Malware Analysis: CH 4: A Crash Course in x86 Disassembly
No ratings yet
Practical Malware Analysis: CH 4: A Crash Course in x86 Disassembly
50 pages
User Guide: Ideapad Gaming 3 (15", 05)
No ratings yet
User Guide: Ideapad Gaming 3 (15", 05)
62 pages
L08555enga - GOT2000 Rugged Catalogue
No ratings yet
L08555enga - GOT2000 Rugged Catalogue
24 pages
DataTraveler 4000 User's Manual
No ratings yet
DataTraveler 4000 User's Manual
23 pages
Rm0433 Reference Manual: Stm32H742, Stm32H743/753 and Stm32H750 Value Line Advanced Arm - Based 32-Bit Mcus
No ratings yet
Rm0433 Reference Manual: Stm32H742, Stm32H743/753 and Stm32H750 Value Line Advanced Arm - Based 32-Bit Mcus
3,353 pages
Parts of A Computer
No ratings yet
Parts of A Computer
37 pages
GB 4 Dispenser Interface Converter Technical Guide
No ratings yet
GB 4 Dispenser Interface Converter Technical Guide
32 pages
Parallel & Distributed Computing
No ratings yet
Parallel & Distributed Computing
58 pages
VVT Education Training Systems Update - 2021V2
No ratings yet
VVT Education Training Systems Update - 2021V2
63 pages
NIO256 (Z420) CMOS Battery Replacement and BIOS Settings
No ratings yet
NIO256 (Z420) CMOS Battery Replacement and BIOS Settings
7 pages
Activity: Krati Chordia 18BCE10142 Slot: B11
No ratings yet
Activity: Krati Chordia 18BCE10142 Slot: B11
3 pages
Book 005 LaMeres Quickstart To Verilog CoverNtoc
No ratings yet
Book 005 LaMeres Quickstart To Verilog CoverNtoc
10 pages
TM103 Final by ISA 2nd Edition
No ratings yet
TM103 Final by ISA 2nd Edition
87 pages
AC97 WM9710 Datasheet
No ratings yet
AC97 WM9710 Datasheet
32 pages
Ch01 Introduction & History Application
No ratings yet
Ch01 Introduction & History Application
24 pages
How PC Power Supplies Work
No ratings yet
How PC Power Supplies Work
6 pages