Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Digital Signal Processing: Comp Eng 4Tl4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

COMP ENG 4TL4:

Digital Signal Processing


Notes for Lectures #31 & #32
Tuesday, November 25 &
Wednesday, November 26, 2003

8. Introduction to DSP
Architectures
4TL4 DSP
Jeff Bondy and Ian Bruce

DSP Applications
z

High volume embedded systems


z
z
z
z
z

Cell phones
Hard Drives
CD Drives
Modems
Printers

High performance data processing


z
z
z

Sonar
Wireless Basestations
Video/Data Transport
3

Resources
z
z
z
z

z
z

www.bdti.com (Started kernel speed benchmarking)


www.eembc.org (Benchmarks for almost any
application)
http://www.techonline.com/community/tech_group/dsp
(Motorola) http://ewww.motorola.com/webapp/sps/site/homepage.jsp?no
deId=06M10NcX0Fz
(TI) http://dspvillage.ti.com/
(Analog Devices)
http://www.analog.com/Analog_Root/static/technology/
dsp/beginnersGuide/index.html/
4

In ONE Cycle
z
z
z
z

z
z
z
z

Fetch instruction
FETCH
Decode instruction
DECODE
Calculate address
Fetch data
z L2 hopefully, or else increase latency by
going off chip, update L2 state
z L2 L1, update L2 and L1 state
READ
z L1 Registers
z Registers ALU
Compute instruction
EXECUTE
Write result
Update data pointers
Update instruction pointer
5

Intro to DSP Architecture


z
z
z
z
z
z

What and Why of MACs


Multiple Memory Accesses
Fast Address Generation Units
Fast Looping
Specialized Instruction Sets
Lots of I/O

Typical DSP Heart


Data Buses
Abundant Instant
Memory Access
Huge ALU Dynamic Range
FAST ALU
Chained Shifter for
repetitive calculations
Barrel Shifter

MACs Multiply Accumulates


z

In one clock cycle the ALU of a DSP can do a


multiply and addition.
z

Used in:
z
z
z
z

Vector dot products


Correlation
Filters
Fourier Transforms

In addition to ALU changes the bus structure


must also change
8

Multiple Memory Accesses


z

Complete MANY memory accesses in a


single clock cycle
z

Processor can fetch instructions while also


fetching the operands or storing to memory
z

During FIR filter can operate a multiply and


accumulate while loading the operands and coefficient
for the next cycle

Three reads and one or two writes per cycle

This requires multiple memory buses on the


same chip, not simply an address and data
bus
9

Dedicated Address Generation


z

One or more address generation units, so the


processor doesnt tie up the ALU/main data
path
z
z
z

Register indirect addressing with post-increment


Modulo addressing
Bit reversed addressing

10

Efficient looping
z

For repetitive, or branching calculations. Fornext loops in a general purpose algorithm kill
performance with calculating conditions,
checking loop logic and setting JUMPs.
z

z
z

<loop> and <repeat> instructions allow jumping to


top of loop while incrementing and testing loop
logic in a SINGLE cycle.

Delayed branching
Low~Mid range DSPs have 3~5 stage
pipelines to get rid of NOPs
11

Pipelining
None (Motorola 560xx, ie. OLD)
Fetch

Decode

Read

Execute
Fetch

Decode

Read

Execute

Pipelined (Most conventional DSP processors)


Fetch

Decode

Read

Execute

Fetch

Decode

Read

Execute

Fetch

Decode

Read

Execute

Superscalar (Pentium, MIPS)


Fetch

Decode

Read

Execute

Fetch

Decode

Read

Execute

Fetch

Decode

Read

Execute

12

Instruction Sets
z

Maximize use of underlying hardware


z

Increase instruction efficiency, complex instructions,


many different operations/accesses per call.

Minimize amount of memory used


z

Instructions must be short, restrict flexibility such as


register choice, multiple operation connections.
z

DSPs have fewer/smaller registers, use mode bits to morph


some operations, highly individualized and irregular
instructions sets.

You can compile C code into a DSP target but for


efficient code it MUST BE HAND OPTIMIZED.
13

Lots of I/O
z

Large array and amount of I/O versus


microprocessor
Specialized instruction set and hardware to
deal with fast off-chip memory access such
as DMA

14

GPP exceptions
z

General Purpose Processors have fought


back because of the huge market that DSPs
were beginning to encroach on
z
z
z
z
z

MMX (Pentium)
SSE (Pentium)
SH-2 (Strong Arm)
Power PC (AltiVec)
UltraSPARC (VIS Visual Instruction Set)

Strange? Isnt this what CRAY was saying


about vectorizing processors was the most
powerful architecture?
15

Pentium 266 MMX Versus


TMS32062x
z
z
z
z
z
z

4x More power
1/3 MIPS
1/3 256-FFT completion time
Same price
4x Die Size
Pentium needs extensive cooling

16

Modulo Addressing
Modulo addressing
z

implementing
circular buffers
and delay lines

Data-shifting
Time

Buffer contents

Next sample

n=N

xN-K+1 xN-K+1

xN-1

xN

xN+1

n=N+1

xN-K+2 xN-K+3

xN

xN+1

xN+2

n=N+2

xN-K+3 xN-K+4

xN+1

xN+2

xN+3

Time

Buffer contents

Next sample

n=N

xN-2

xN-1

xN

n=N+1

xN-2

xN-1

xN

xN+1

n=N+2

xN-2

xN-1

xN

xN+1

xN-K+1 xN-K+2
xN-K+2 xN-K+3
xN+2 xN-K+3 xxN-K+4
N-K+4

xN+1
xN+2
xN+3
17

DSP Characteristics
z
z
z
z
z
z
z
z

Arithmetic Format
Bus Width
Speed
Memory/Bus/Instruction architecture
Development Tools
Power Consumption
Cost
Specialized Hardware
18

Arithmetic
z

Fixed Point or Floating Point?


z
z
z

Fixed: numbers are integers in a set range


Float: numbers are represented by a mantissa
and exponent
Fixed: cheaper, higher volume, faster, less power,
horrible amounts of time tweaking and rescaling
at different points in a calculation. 95% of DSP
Market.
Float: Wider dynamic range, larger die size,
easier, becoming more available. 5% of DSP
Market.
19

Bus Widths
z
z

Fixed: usually 16 bit data bus


Float: 32 bit, standard IEEE mantissaexponent format
z

Motorola DSP56300 family is a widely used,


notable exception, its 24 bit fixed point.
z

Almost the defacto standard for audio processing


applications. Why? Think about the dynamic range of
the auditory system: Your ear has about 120 dB of
dynamic range.
So w/ linear, uniform coding @ 16 bits and 24 bits:

10^(120/20)/(2^16) = 15.25
10^(120/20)/(2^24) = .0595

20

Speed
z

Specmanship has inundated all aspects of


silicon specification so beware
z
z

z
z

MHz: What is the on-chip clock speed?


MIPS: Meg. Instructions Per Second, the
reciprocal of the fastest instructions time divided
by 106.
MMACS: Meg. Multiply-Accumulates per Second.
Kernel Times: For specific tasks, 256 point FIR,
Radix-2 FFT, what is the absolute time?
21

Specmanship of Speed

* www.bdti.com, Independent
DSP benchmark results for
the latest processors

22

Memory
z

Most built around fast bus architecture


z

Harvard architecture splits Address and Data buses


and memory locations (versus von Neumann)
Cache to fetch instructions freeing up bus to fetch
or write.

Embedded systems have smaller memory


needs
Variable instruction sizes and memory sizes

23

Development Tools
z

S/W Tools: assemblers, linkers simulators,


debuggers, compilers, code libraries, RTOS
z

z
z

DSPs are compiler unfriendly. Unusual and


complex instruction sets. C/Ada produce bloated
code, intricacies of number crunching almost
always coded in Assembler. Floating point
processors usually compile cleaner then Fixed

H/W Tools: emulators, development boards


JTAG: IEEE 1149.1, on chip debugging and
emulation. Scan based emulation, set
breakpoints like a S/W IDE, poll and set
registers while paused.

24

System Management
z
z

Minimizing Vcc to reduce power consumption


Sleep modes
z
z

Turn off entire sections of the chip, ie. Interface for


an unconnected protocol
Event activation with different latencies, ie. Packet
datacom, doesnt decode a packet unless device
address is pinged

Programmable on-chip clock distribution


z
z

Clock Dividers for integer differences that arise in


digital communication receivers
Phase-Locked-Loops (PLLs) for fine control over
jitter and frequency
25

COST!!
z
z

Limiting factor of any REAL design


Packaging can be 50% of real cost, product
plus manufacturing. Many companies are
going to BGA (Ball Grid Array) packs versus
P/T QFP, (Plastic/Thin Quad Flat Pack),
making them more expensive and
IMPOSSIBLE to rework.

26

Analog Devices: ADSP-2116x


SHARC
z

Has special I/O and instructions that


accelerates multiprocessor connections
z
z

6 processors strung together with bus arbitration


Any processor can access the internal memory of
any other processor

Also replicates the entire operational block,


giving you two powerful processors and
making extensive use of SIMD (more on this
later).
27

Low Range DSPs


z

Analog Devices
z

Motorola
z

DSP-560xx

Texas Instruments
z

ADSP-210x

TMS320F28x

~40 MHz Clock, usually used as a souped up


microcontroller.
Disk drives, cordless phones, ISM band
equipment
28

Mid Range DSPs


z

Analog Devices
z

Motorola
z

DSP-563xx

Texas Instruments
z

ADSP-218x

TMS320C52x

150 MHz, cell-phones, modems.

29

Very Large Instruction Word


z
z

TI TMS320c62xx First DSP


VLIW use simple, orthogonal, RISC based
instruction sets. String several 4, 8 or 16 bit
instructions together that use different parts
of the H/W to execute every cycle
Compile cleaner because of simpler
instruction sets, but hand-optimization is
harder because of heuristic scheduling for the
H/W components.
30

TMS320C62xx
One instruction is fed
into two sets of four
execution units.
Instead of the MAC-ALU
serial structure you
have them in parallel,
meaning each top-down
operation is less
complex, but may take
more instructions
31

VLIW v Superscalar
VLIW produces code AT COMPILATION that
identifies which instructions are completed in
parallel
z Superscalar hardware AT EXECUTION
identifies which instructions are completed in
parallel
!! That means that for different iterations
through a loop a different order of instructions
could be completed. Unusual processing
times
z

32

Single-Instruction Multiple
Data
z

Instead of splitting instructions, splits


operational blocks. A 16 bit MAC turns into
two 8 bit MACs.
Allows a processor to execute multiple
instances of the same operation using
different data.

33

Choose Your Own Adventure


z
z
z
z
z

What DSP code looks like


DSP Devices that you might be working with
Short introduction to DSP on video cards
MMX/SSE overview
Reading DSP spec sheets

34

FIR Filters with Assembler


MOT DSP563xx
main()
{
/* Control logic system setup and whatnot
..........................................
*/
// Begin with an assembler call
asm
{

(2)
(1)
(5)
(N)
(1)
(1)

move
move
move
move
movep
clr a
rep #N-1
mac
macr
movep

#AADDR,r0
// Register r0 load, will contain coeffs
#BADDR,r4
// Register r4 load, will contain data
#N-1,m4
// Load loop control
m4,m0
// move loop control
y:input,y:(r4)
// move peripheral data from Input "y"
x:(r0)+,x0 y:(r4)-,y0
// clear accumulator, memory moves
// Repeat next instruction
x0,y0,a x:(r0)+,x0 y:(r4)-,y0 // Multiply Accumulate, update registers
c0,y0,a (r4)+
// Rounding and scaling (set by c0)
a,y:output
// move accumulator output to peripheral "y"

}
// End assembler call
/* Control logic system setup and whatnot
..........................................
*/
}

35

Differences in Assembler
codes
main:
bits
lda
lda
lda
mov
mov
mov
mov
add
add
add
mov
bits
mov
mov

%fmode, 2
/* Enable Q15 */
r13, Xdata
r15, Dbuffer
r11, Yout
r10, 40
/* Filter size, Nlen = 40 $$$ */
r9, 200 /* Input data size (Nsamp = 200) $$$ */
%cb1_beg, r15
r8, r10
/* r8 = Nlen */
r8, 1
/* r8 = Nlen+1 */
r10, -1
/* Adjust for loop counter */
r8, r15
%cb1_end, r8
/* CB size = Nlen+1 */
%smode, 2
/* Enable CB1 (for r15) */
r6, 10000
%timer0, r6
/* Initialize Timer count */
/* Worst case cycle count = */
/* (Nlen + 6)*Nsamp */

per_sample:
ldu r7, r13, 1 /* "Acquire" new sample from "Xdata",*/
/* a pre-stored input buffer -- in a */
/* real-time application, this new */
/* sample may come from a different */
/* task or an external device, etc. */
mov %loop0, r10
lda r14, Hfilter
psub.a r0, r0
/* Clear accumulator's 32-bits */
st r7, r15
/* Store new sample into Dbuffer */
mov %guard, 0
/* Clear Guard bits */
bits %tc, 7
/* Timer0 starts ticking */

fir_loop:
ldu r4, r14, 1
/* Filter coefficient */
ldu r2, r15, 1
/* Sample from Data buffer (circular) */
mac.a r2, r4
agn0 fir_loop
bitc %tc, 7
/* Timer0 frozen */
round.e r0, r0
/* Filter output is rounded */
stu r1, r11, 1
/* Filter output is stored */
flag1:
nop
add r9, -1
bnz per_sample
nop
filter_done:
/* Set an SDBUG break-point here */
nop
/* Note: ZSIM or RTL need a HALT here */
nop
br filter_done
nop

This is from the LSI website, and in


my mind, one of the reasons why
they have lost some market share
36

Analog Devices Overview


CHIPS
Vendor

Analog
Devices

Family

Floating,
Fixed, or
Both

ADSP218x

Fixed
point

ADSP219x

Fixed
point

16 bits

ADSP2116x
(SHARC)

Floating
point

32/40
bits

ADSPBF53x
(Blackfin)
ADSPTS20x
(TigerSH
ARC)

Fixed
point

Both

Data Width

16 bits

16 bits

8/16/32/4
0 bits

Instruction
Width

24 bits

24 bits

48 bits

16/32
bits

32 bits

Core Clock
Speed [1]

80 MHz

160 MHz

100 MHz

600 MHz

600 MHz

BDTImark2
000
BDTIsimMar
k2000 [2]

Total OnChip
Memory,
Bytes

240

20 K
256 K

410

20 K
160 K

470

128 K
512 K

3360 [5]

6150 [5]

84 K
308 K

512 K
3M

Core
Voltage

1.8

2.5

1.8, 2.5

0.71.2,
1.01.6

1.0, 1.2

* From http://www.bdti.com

Unit Price
[3]

Notes

$424

Many family
members w/
assorted
peripherals

$1024

Enhanced
version of
the ADSP218x

$2299

Features
SIMD,
strong
multiprocess
or support

$635

$35299

Dual-MAC
DSP with
variable
speed and
voltage
4-way VLIW
with SIMD
capabilities;
uses
eDRAM

37

Motorola Devices Overview


CHIPS
Vendor

Family

Floating,
Fixed, or
Both

DSP56
3xx

Fixed
point

Data Width

24 bits

Instruction
Width

Core Clock
Speed [1]

24 bits

240
MHz

BDTImark2
000
BDTIsimMar
k2000 [2]

Total OnChip
Memory,
Bytes

Core
Voltage

710

24 K
384 K

1.5, 1.6,
1.8, 3.3

Unit Price
[3]

Notes

$456

PCI bus,
DMA, can
run 560xx
code
unmodified

DSP56
8xx

Fixed
point

16 bits

16 bits

40 MHz
[6]

110

28 K
152 K

2.5, 3.3

$315

Contains
many
microcontrol
ler-like
features

DSP56
85x

Fixed
point

16 bits

16 bits

120
MHz

340

36 M

1.8

$612

Enhanced
version of
the 568xx

Motorola

MSC81
0x
(SC140
)

Fixed
point

16 bits

16 bits

300
MHz

3370 [7]

512 K
1436 K

1.6

* From http://www.bdti.com

$90
195

Based on
quad-MAC
SC140 core;
8102 uses
4 cores

38

TI Devices Overview
* From http://www.bdti.com
CHIPS
Vendor

TI

Family

Floating,
Fixed, or
Both

TMS320
F24x

Fixed
point

TMS320
F28x

Fixed
point

TMS320
C3x

Floating
point

TMS320
C54x

Fixed
point

Data Width

Instruction
Width

Core Clock
Speed [1]

BDTImark2
000
BDTIsimMar
k2000 [2]

Total OnChip
Memory,
Bytes

Core
Voltage

Unit Price
[3]

Notes

16 bits

16/32
bits

40 MHz

n/a

18 K
1120 K

3.3, 5.0

$315

Hybrid
microcontrolle
r/DSP

32 bits

16/32
bits

150 MHz

n/a

164 K
292 K

$1618

Hybrid
microcontrolle
r/DSP;
compatible w/
C24x

32 bits

32 bits

75 MHz
[6]

n/a

264 K
2304 K

3.3, 5.0

$10213

Costcompetitive
with fixed
point DSPs

16 bits

16 bits

160 MHz

500

24 K
1280 K

1.5, 1.6,
1.8, 2.5,
3.3

$4109

Many
specialized
instructions

1.8

TMS320
C55x

Fixed
point

16 bits

848 bits

300 MHz

1460

80 K
376 K

1.26, 1.5,
1.6

$520

Next
generation
C5xxx
architecture;
dual-issue,
dual-MAC
DSP

TMS320
C62x

Fixed
point

16 bits

32 bits

300 MHz

1920

72 K
896 K

1.5, 1.8

$9102

8-way VLIW

TMS320
C64x

Fixed
point

8/16 bits

32 bits

720 MHz

6570

288 K
1056 K

1.0, 1.2,
1.4

$39277

Next
generation
C6xxx
architecture

TMS320
C67x

Floating
point

32 bits

32 bits

225 MHz

1100

64 K
264 K

1.2, 1.26,
1.8, 1.9

$14110

Floating point
version of
C62x

39

Cores versus Chips

40

NVidia NV3x Video Card Core NVIDIA GEFORCE FX 5900


Cut input into little quads
Interpolater

Programmable DSP Core


Different Units for different
processes

Fusing and smoothing

41

NV3x Guts

42

MMX versus SSE


z

MMX: 51 New processor instructions for Pentium II


z
z
z

MMX = MultiMedia eXtensions


SIMD for integers
MMX instructions operate on two 32-bit integers
simultaneously

SSE: 70 New processor instructions and subtle


architecture differences for the Pentium III and later
z
z
z
z

SSE = Streaming SIMD extensions


Pentium III introduction did not follow Moores law on clock
speed, but on most operations because of it
SIMD for single-precision floating-point numbers
SSE instructions operate on four 32-bit floats
simultaneously.
43

SSE Architecture Changes


z

New registers, each is 128 bits long and can


hold four single-precision (32 bit) floatingpoint numbers

44

SSE Advantages
z

An application cannot execute MMX instructions


and perform floating-point operations
simultaneously.
Operations accelerated with SSE instructions
are matrix multiplication, matrix transposition,
matrix-matrix operations like addition,
subtraction, and multiplication, matrix-vector
multiplication, vector normalization, vector dot
product, and lighting calculations.
45

MMX Benchmark

Deependra Talla and Lizy K. John (1999) Performance Evaluation and


Benchmarking of Native Signal Processing European Conference on Parallel
Processing

46

ADSP-TS20x TigerSHARC
VLIW and SIMD:
Split one instruction
between two units (VLIW),
and each of those units
can split their part of the
instruction into sub units.
In this example we can see
one uber-instruction can
call 8 16-bit multiplies.

* Walkthrough of ADSP-TS201 Spec Sheet


47

Motorola DSP56367
z

Walkthrough of SPECSHEET

48

Texas Instruments
TMS320VC5421
z

Spec Sheet Walkthrough

49

You might also like