Digital Signal Processing: Comp Eng 4Tl4
Digital Signal Processing: Comp Eng 4Tl4
Digital Signal Processing: Comp Eng 4Tl4
8. Introduction to DSP
Architectures
4TL4 DSP
Jeff Bondy and Ian Bruce
DSP Applications
z
Cell phones
Hard Drives
CD Drives
Modems
Printers
Sonar
Wireless Basestations
Video/Data Transport
3
Resources
z
z
z
z
z
z
In ONE Cycle
z
z
z
z
z
z
z
z
Fetch instruction
FETCH
Decode instruction
DECODE
Calculate address
Fetch data
z L2 hopefully, or else increase latency by
going off chip, update L2 state
z L2 L1, update L2 and L1 state
READ
z L1 Registers
z Registers ALU
Compute instruction
EXECUTE
Write result
Update data pointers
Update instruction pointer
5
Used in:
z
z
z
z
10
Efficient looping
z
For repetitive, or branching calculations. Fornext loops in a general purpose algorithm kill
performance with calculating conditions,
checking loop logic and setting JUMPs.
z
z
z
Delayed branching
Low~Mid range DSPs have 3~5 stage
pipelines to get rid of NOPs
11
Pipelining
None (Motorola 560xx, ie. OLD)
Fetch
Decode
Read
Execute
Fetch
Decode
Read
Execute
Decode
Read
Execute
Fetch
Decode
Read
Execute
Fetch
Decode
Read
Execute
Decode
Read
Execute
Fetch
Decode
Read
Execute
Fetch
Decode
Read
Execute
12
Instruction Sets
z
Lots of I/O
z
14
GPP exceptions
z
MMX (Pentium)
SSE (Pentium)
SH-2 (Strong Arm)
Power PC (AltiVec)
UltraSPARC (VIS Visual Instruction Set)
4x More power
1/3 MIPS
1/3 256-FFT completion time
Same price
4x Die Size
Pentium needs extensive cooling
16
Modulo Addressing
Modulo addressing
z
implementing
circular buffers
and delay lines
Data-shifting
Time
Buffer contents
Next sample
n=N
xN-K+1 xN-K+1
xN-1
xN
xN+1
n=N+1
xN-K+2 xN-K+3
xN
xN+1
xN+2
n=N+2
xN-K+3 xN-K+4
xN+1
xN+2
xN+3
Time
Buffer contents
Next sample
n=N
xN-2
xN-1
xN
n=N+1
xN-2
xN-1
xN
xN+1
n=N+2
xN-2
xN-1
xN
xN+1
xN-K+1 xN-K+2
xN-K+2 xN-K+3
xN+2 xN-K+3 xxN-K+4
N-K+4
xN+1
xN+2
xN+3
17
DSP Characteristics
z
z
z
z
z
z
z
z
Arithmetic Format
Bus Width
Speed
Memory/Bus/Instruction architecture
Development Tools
Power Consumption
Cost
Specialized Hardware
18
Arithmetic
z
Bus Widths
z
z
10^(120/20)/(2^16) = 15.25
10^(120/20)/(2^24) = .0595
20
Speed
z
z
z
Specmanship of Speed
* www.bdti.com, Independent
DSP benchmark results for
the latest processors
22
Memory
z
23
Development Tools
z
z
z
24
System Management
z
z
COST!!
z
z
26
Analog Devices
z
Motorola
z
DSP-560xx
Texas Instruments
z
ADSP-210x
TMS320F28x
Analog Devices
z
Motorola
z
DSP-563xx
Texas Instruments
z
ADSP-218x
TMS320C52x
29
TMS320C62xx
One instruction is fed
into two sets of four
execution units.
Instead of the MAC-ALU
serial structure you
have them in parallel,
meaning each top-down
operation is less
complex, but may take
more instructions
31
VLIW v Superscalar
VLIW produces code AT COMPILATION that
identifies which instructions are completed in
parallel
z Superscalar hardware AT EXECUTION
identifies which instructions are completed in
parallel
!! That means that for different iterations
through a loop a different order of instructions
could be completed. Unusual processing
times
z
32
Single-Instruction Multiple
Data
z
33
34
(2)
(1)
(5)
(N)
(1)
(1)
move
move
move
move
movep
clr a
rep #N-1
mac
macr
movep
#AADDR,r0
// Register r0 load, will contain coeffs
#BADDR,r4
// Register r4 load, will contain data
#N-1,m4
// Load loop control
m4,m0
// move loop control
y:input,y:(r4)
// move peripheral data from Input "y"
x:(r0)+,x0 y:(r4)-,y0
// clear accumulator, memory moves
// Repeat next instruction
x0,y0,a x:(r0)+,x0 y:(r4)-,y0 // Multiply Accumulate, update registers
c0,y0,a (r4)+
// Rounding and scaling (set by c0)
a,y:output
// move accumulator output to peripheral "y"
}
// End assembler call
/* Control logic system setup and whatnot
..........................................
*/
}
35
Differences in Assembler
codes
main:
bits
lda
lda
lda
mov
mov
mov
mov
add
add
add
mov
bits
mov
mov
%fmode, 2
/* Enable Q15 */
r13, Xdata
r15, Dbuffer
r11, Yout
r10, 40
/* Filter size, Nlen = 40 $$$ */
r9, 200 /* Input data size (Nsamp = 200) $$$ */
%cb1_beg, r15
r8, r10
/* r8 = Nlen */
r8, 1
/* r8 = Nlen+1 */
r10, -1
/* Adjust for loop counter */
r8, r15
%cb1_end, r8
/* CB size = Nlen+1 */
%smode, 2
/* Enable CB1 (for r15) */
r6, 10000
%timer0, r6
/* Initialize Timer count */
/* Worst case cycle count = */
/* (Nlen + 6)*Nsamp */
per_sample:
ldu r7, r13, 1 /* "Acquire" new sample from "Xdata",*/
/* a pre-stored input buffer -- in a */
/* real-time application, this new */
/* sample may come from a different */
/* task or an external device, etc. */
mov %loop0, r10
lda r14, Hfilter
psub.a r0, r0
/* Clear accumulator's 32-bits */
st r7, r15
/* Store new sample into Dbuffer */
mov %guard, 0
/* Clear Guard bits */
bits %tc, 7
/* Timer0 starts ticking */
fir_loop:
ldu r4, r14, 1
/* Filter coefficient */
ldu r2, r15, 1
/* Sample from Data buffer (circular) */
mac.a r2, r4
agn0 fir_loop
bitc %tc, 7
/* Timer0 frozen */
round.e r0, r0
/* Filter output is rounded */
stu r1, r11, 1
/* Filter output is stored */
flag1:
nop
add r9, -1
bnz per_sample
nop
filter_done:
/* Set an SDBUG break-point here */
nop
/* Note: ZSIM or RTL need a HALT here */
nop
br filter_done
nop
Analog
Devices
Family
Floating,
Fixed, or
Both
ADSP218x
Fixed
point
ADSP219x
Fixed
point
16 bits
ADSP2116x
(SHARC)
Floating
point
32/40
bits
ADSPBF53x
(Blackfin)
ADSPTS20x
(TigerSH
ARC)
Fixed
point
Both
Data Width
16 bits
16 bits
8/16/32/4
0 bits
Instruction
Width
24 bits
24 bits
48 bits
16/32
bits
32 bits
Core Clock
Speed [1]
80 MHz
160 MHz
100 MHz
600 MHz
600 MHz
BDTImark2
000
BDTIsimMar
k2000 [2]
Total OnChip
Memory,
Bytes
240
20 K
256 K
410
20 K
160 K
470
128 K
512 K
3360 [5]
6150 [5]
84 K
308 K
512 K
3M
Core
Voltage
1.8
2.5
1.8, 2.5
0.71.2,
1.01.6
1.0, 1.2
* From http://www.bdti.com
Unit Price
[3]
Notes
$424
Many family
members w/
assorted
peripherals
$1024
Enhanced
version of
the ADSP218x
$2299
Features
SIMD,
strong
multiprocess
or support
$635
$35299
Dual-MAC
DSP with
variable
speed and
voltage
4-way VLIW
with SIMD
capabilities;
uses
eDRAM
37
Family
Floating,
Fixed, or
Both
DSP56
3xx
Fixed
point
Data Width
24 bits
Instruction
Width
Core Clock
Speed [1]
24 bits
240
MHz
BDTImark2
000
BDTIsimMar
k2000 [2]
Total OnChip
Memory,
Bytes
Core
Voltage
710
24 K
384 K
1.5, 1.6,
1.8, 3.3
Unit Price
[3]
Notes
$456
PCI bus,
DMA, can
run 560xx
code
unmodified
DSP56
8xx
Fixed
point
16 bits
16 bits
40 MHz
[6]
110
28 K
152 K
2.5, 3.3
$315
Contains
many
microcontrol
ler-like
features
DSP56
85x
Fixed
point
16 bits
16 bits
120
MHz
340
36 M
1.8
$612
Enhanced
version of
the 568xx
Motorola
MSC81
0x
(SC140
)
Fixed
point
16 bits
16 bits
300
MHz
3370 [7]
512 K
1436 K
1.6
* From http://www.bdti.com
$90
195
Based on
quad-MAC
SC140 core;
8102 uses
4 cores
38
TI Devices Overview
* From http://www.bdti.com
CHIPS
Vendor
TI
Family
Floating,
Fixed, or
Both
TMS320
F24x
Fixed
point
TMS320
F28x
Fixed
point
TMS320
C3x
Floating
point
TMS320
C54x
Fixed
point
Data Width
Instruction
Width
Core Clock
Speed [1]
BDTImark2
000
BDTIsimMar
k2000 [2]
Total OnChip
Memory,
Bytes
Core
Voltage
Unit Price
[3]
Notes
16 bits
16/32
bits
40 MHz
n/a
18 K
1120 K
3.3, 5.0
$315
Hybrid
microcontrolle
r/DSP
32 bits
16/32
bits
150 MHz
n/a
164 K
292 K
$1618
Hybrid
microcontrolle
r/DSP;
compatible w/
C24x
32 bits
32 bits
75 MHz
[6]
n/a
264 K
2304 K
3.3, 5.0
$10213
Costcompetitive
with fixed
point DSPs
16 bits
16 bits
160 MHz
500
24 K
1280 K
1.5, 1.6,
1.8, 2.5,
3.3
$4109
Many
specialized
instructions
1.8
TMS320
C55x
Fixed
point
16 bits
848 bits
300 MHz
1460
80 K
376 K
1.26, 1.5,
1.6
$520
Next
generation
C5xxx
architecture;
dual-issue,
dual-MAC
DSP
TMS320
C62x
Fixed
point
16 bits
32 bits
300 MHz
1920
72 K
896 K
1.5, 1.8
$9102
8-way VLIW
TMS320
C64x
Fixed
point
8/16 bits
32 bits
720 MHz
6570
288 K
1056 K
1.0, 1.2,
1.4
$39277
Next
generation
C6xxx
architecture
TMS320
C67x
Floating
point
32 bits
32 bits
225 MHz
1100
64 K
264 K
1.2, 1.26,
1.8, 1.9
$14110
Floating point
version of
C62x
39
40
41
NV3x Guts
42
44
SSE Advantages
z
MMX Benchmark
46
ADSP-TS20x TigerSHARC
VLIW and SIMD:
Split one instruction
between two units (VLIW),
and each of those units
can split their part of the
instruction into sub units.
In this example we can see
one uber-instruction can
call 8 16-bit multiplies.
Motorola DSP56367
z
Walkthrough of SPECSHEET
48
Texas Instruments
TMS320VC5421
z
49