zyxw
zy
zyxwvutsr
The DSP32C: AT&T’s
Second-Generation Floating-Point
Digital Signal Processor
DSP32
mpatibility,
a 12.5-MIPS
execution rate,
and ease of use
ensure the success
of a programmable
chip designed for
computatiop1intensive applications
T
he WEDSP32C high-performance, programmable digital signal processor supports 32-bit floating-point arithmetic and is upwardly compatible with its predecessor, the WEDSP32.
(See Figure 1.)
Because it is implemented in 0.75-pm (effective channel length) CMOS
technology, the second-generation device achieves high functional density
with low power consumption.
At a clock rate of 50 MHz, the DSP32C executes 12.5 million instructions per second. This performance implies that it is capable of performing
25 million floating-point operations per second. The device also performs
24-bit integer operations at the rate of 12.5 million operations per second.
These high performance rates permit users to program the DSP32C to implement a wide variety of computation-intensive applications.
Floating-point arithmetic removes a programmer’s problems with number scaling-problems that are endemic to fixed-point-arithmetic signal
processors. Besides floating-point-arithmetic capabilities, the DSP32C
ease-of-use features simplify its insertion into DSP hardware and software
environments. Availability of software and hardware development tools,
including an optimizing C-language compiler, also ensures an ease of use
previously unavailable with programmable digital signal processors.
The following features allow the DSP32C’s insertion into nontraditional
uses of DSP chips such as computer graphics and industrial control:
25-Mflop operation;
l6-Mbit/s serial-input and serial-output ports;
a 16-bit, parallel I/O port for control and data transfer;
interrupt facilities;
single-instruction p-law and A-law data conversions;
single-instruction conversions between integers and floating-point
data;
a byte-addressable, on-chip memory that is extendable off chip;
direct memory access to and from internal and external memory via
parallel and serial I/O ports;
16 Mbytes of address space; and
IEEE Std. 754 floating-point format conversion. 3
While remaining upwardly compatible with the DSP32, the DSP32C offers the several enhancements listed in Table 1. The DSP32C contains
405,000 transistors in an area of 88 square millimeters that is enclosed in a
standard, 133-pin-square PGA (pin grid array) package.
zyxwvu
zyxwvuts
zyxwvutsrqp
Michael L . Fuccio, Renato
N . Gadenz, Craig J. Garen,
Joan M. Huser, Benjamin Ng,
and Steven P. Pekarich
A T&T Bell Laboratories
zyxwvutsr
zyxwvutsr
Kreg D. Ulery
A T&T Microelectronics
30
IEEEMICRO
0 2 7 2 ~ l 7 3 2 / 6 8 / l 2 ~ 3 00
~ l1988
. ~ IEEE
zyxw
zyxwvutsrq
zyxwvuts
Figure 1. Microphotograph of the DSPSX.
Here, we describe the DSP32C’s instruction set, architecture, and application development tools. The latter includes an assembler, a simulator, an optimizing C
compiler, and special-purpose hardware.
Table 1.
DSP32C enhancements.
An overview of the architecture
Figure 2 on the next page shows the block diagram
of the DSP32C. Two execution units, the control
arithmetic unit (CAU) and the data arithmetic unit
(DAU), achieve the high throughput characteristic of
the device. Each unit has its own instruction set. The
CAU performs 16- or 24-bit integer arithmetic for logic
and control functions, while the DAU performs 32-bit
floating-point and data conversion operations. The
CAU can function in an autonomous fashion, performing data transfers, branching control, and integer
arithmetic and logic operations in parallel with the
DAU floating-point operations. In addition, the CAU
generates addresses for the DAU operands.
Besides the program counter, the CAU contains 22
general-purpose registers, which can be used for CAU
instructions. The contents of registers R1 through R14
also function as memory pointers, and the contents of
registers R15 through R19 as pointer increments in
DAU instructions. With the exception of a register
load from memory, execution of a CAU instruction
completes before the next instruction begins execution.
This feature simplifies the use of the CAU for logic and
control operations.
The DAU, on the other hand, employs a four-stage
pipeline to perform 25 million floating-point computations per second. Configured for multiply/accumulate
operations, the DAU is the primary execution unit for
signal processing algorithms. It contains a floatingpoint multiplier and a floating-point adder that work
zyx
zyx
zyxwvutsrq
in parallel to perform computations of the form a = b
+c*d.
The DAU has four 40-bit accumulators. It employs a
straightforward fetch-multiply-accumulate-store
pipeline, which we explain in detail later. Briefly, the
DAU executes a multiply/accumulate instruction in
four stages: fetch of c and d, multiplication of c and d,
accumulation of the c and d product with b (with the
result stored in an accumulator), and an optional write
of the result to memory or an 11’0 port. A maximum of
two of the three multiply/accumulate operands can
come from memory. T h e other operand comes from an
accumulator register.
December 1988 31
DSP32C
zyxwv
PABO-PAB3
PDBOO-PDB15
PEN[
ROMO-
PGN
PDR2 (16)
PWN
PDF
PCR (10)
PIF
EMR (16)
RAM2
RAM1
RAM
512 x 32
RAM
512 x 32
zyxwvutsrqp
OR
4x512
words
x 32 bits
1-77
t
Pipeline
control
IR
Floating-
multiplier
I
OSE
OEN
C-
IR1
C-
IR2
Floatingpoint
C-
AO-A3 (40)
C-
CKI. RESTN, ZN. BREQ
INTREQ1, lNTREQ2
Utility pins
I
DAUC
I
IACK1. IACK2. BRACKN
AO-A3
ALU
CAU
DAU
DAUC
EMR
ESR
lbuf
IOC
IR
IR1-IR4
zy
zyxwvu
r
zyxwvutsrqponmlk
I
4 I zyxwvutsrqpon
r
zyxwvutsr
c
DAU
ILD
Dl
Accumulators 0-3
Arithmetic logic unit
Control arithmetic unit
Data arithmetic unit
DAU control register
Error mask register
Error source register
Input buffer
Inputloutput control register
Instruction register
Instruction register pipeline
ISR
IVTP
Obuf
OSR
PAR
PARE
PC
PCR
PCW
PDR
PDR2
Input shifl register
Input vector table pointer
Output buffer
Output shifl register
PI0 address register
PI0 address register extended
Program counter
PI0 control register
Processor control word
PI0 data register
PI0 data register 2
CAU
4
I '
I
IR3
c
ALU 16/24
PC (24)
R1 - R14 (24)
R15- R19 (24)
Pin (24)
Pout (24)
IR4
IVTP (24)
Pin
PI0
Plop
PIR
Pout
Rim9
RAM
ROM
SI0
Serial DMA input pointer
Parallel 110 unit
Parallel 110 port register
PI0 interrupt register
Serial DMA output pointer
Registers 1 through 19
Read/write memory
Read-only memory
Serial 1/0unit
zyxwvuts
zyxwvutsrqponm
Figure 2. Block diagram of the DSP32C.
32
IEEEMICRO
zyxwvutsrq
zyxwvutsrqponmlkji
lnteaer Arithmet'Wloaic Instructions
Floatino Point Arithmetic Insiructioq
-
[Z-] a N = [ - I
aN
[-]
= [-]
[-I
a N = [-1
[Z-I a N = [ - I
[Z-] a N = [-]
aN = [ - ]
[Z-] a N
[Z-I a N
-
a M (t,-) Y * X
a M ( t , - ) (Z
Y)
Y (t,-] a M * X
Y * X
(Z = Y ) * X
Y (+,-t
X
Y
( 2 = Y) (+,-) x
-
X
[Z-I a N
[Z=l aN
[Z=] a N
[ Z x ] aN
[Z=] a N
[ Z = ] aN
[Z-] a N
[Z-] a N
[ Z - ] aN
[Z=] aN
=
-
=
=
=
=
=
=
=
[ Z = ] aN =
[z=] aN =
[ Z = ] aN =
ic(Y)
p-Law,A-Law, &bit linear to float
oc(Y)
float(Y)
float24(Y)
int(Y)
int24(Y)
round(Y)
ifalt(Y)
ifaeq(Y)
ifagt(Y)
dsp(Y)
ieee(Y)
seed(Y)
lloatto p-Law, A-Law, &bit linear
16-bitinteger to float
24-bit integer to float
Convertfloat to lbbt integer
float to 24-bil integer
~ o u n to
d 3 2 M float
If a<O then move Y to aN
If a=O then move Y to aN
If *O then move Y to aN
IEEE float to DSP32 format
DSP32 float to IEEE format
Compute approximate
reciprocal of Y
noP
goto ( N , r H , r H + N )
i f (cOND) g o t o ( N , r H , r H + N )
if(rM-->=O)
goto ( N , r H , rH+N)
rH+N)
(rM)
Assiinment/negate
Increment/ck?crement
Add registers(triadic)
Add register to constant
Subtract registers (triadic)
Subtract register from constant
Logical AND registers (triadic)
LogicalAND constant with register
LogicalOR registers(triadic)
Logical ORconstantwithregister
Logical XOR registers(triadic)
LogicalXOR constant with register
Reverse carry add registers
Logical AND with register
complemented
Arithmetic right shift
Logical right shift
Rotate right through carry
Arithmetic left shift
Rotate lefl through cany
-
zyxwvutsrqpo
Control Instructions
c a l l (N, rH,
return (rM)
ireturn
do M, ( K , r H )
-
rD
[-I r S
r D = rS ( t , - t
1
rD = rS1 t rS2
r D = rS t N
r D = rS1
rS2
r D = - rD t N
r D = rS1 L rS2
rD = rD & N
rD
r S 1 I rS2
rD = rD I N
r D = r S 1 A rS2
r D = rD
N
r D = rS1 # rS2
rD = r S 1 6 - 2 2
-
Data can be written using bit reversed addressing.
SDecialFunctions
Both 16 and 24-bit integer operations may be
specified.
zyxwvutsrqp
zyxwvutsrqpo
zyxwvutsrqponmlkj
*
No operation
Branch
Conditional branch
Conditionalbranchwith
loop counter
Call subroutine
Returnfrom subroutine
Return from interrupt
Do M+l instructions
K (or rH) +times
I
rD
rD
.-
rS / 2
r S >> 1
rD = rS
r D = rS
rD = rS
>>>
*
<<<
rS1 - ( N ,
rS2 & ( N ,
1
2
1
Compare
Bit test
rS2
rS2
Instructions that do not use the constant, N, may be
conditionally executed as shown in the following
example:
if(eq) r l = r2
+
r3
Data Move lnsrructioq
rD = N
(Z, *M, o b u f , p d r , p i r , pcw) = rS
i b u f , p d r , p i r , pcw)
Z = l i b u f , p d r , p i r , pcw)
( o b u f , p d r , p i r , pcw) = Y
r D = IY, *N,
Figure 3. DSP32C instruction set
Data travels throughout the device via a 32-bit data
bus as seen in Figure 2. This bus supports four memory
accesses during each instruction cycle: an instruction
fetch, two operand reads, and a write to memory. This
high-speed data bus and the pipelined architecture of
the device allow the DSP32C to fetch two 32-bit
operands from memory, perform multiply and accumulate operations, and write a result to an I/O port or
memory during each instruction cycle.
The DSP32C provides on-chip memory and an external memory interface for off-chip memory expansion. Optional memory configurations allow users to
download programs and data into three 512-word
RAM banks on the chip, or to substitute an 8-Kbyte
ROM for the third RAM bank. A 24-bit addressing
capability increases the external memory capacity to 16
Mbytes. Programs can treat memory as a common
resource, with instructions and data arbitrarily residing
in onchip RAM, on-chip ROM, or external memory.
The external memory interface supports wait states and
bus arbitration.
The parallel I/O, or PIO, port provides a parallel interface for communication between the DSP32C and
external devices. It can be configured as an 8-bit
(DSP32compatible) or as a 16-bit port. The serial 110,
or SIO, port provides serial communication and synchronization with devices outside the DSP32C. Three
on-chip DMA controllers support direct, independent
memory access via the serial input, serial output, and
parallel I/O ports. A single-level interrupt facility can
respond to four internal and two external, individually
maskable sources. A relocatable vector table controls
program flow based on the source of the interrupt.
Instruction set
Figure 3 summarizes the instruction set of the
DSP32C. The enhancementswith respect to the DSP32
appear in boldface type. The assembly-language syntax
used in the DSP32C instruction set is similar to the C
programming language. This similarity is advantageous when moving algorithms that were developed in
a higher level language such as C or Fortran onto the
DSP32C.
December 1988 33
DSP32C
z
zyxwv
-
In Figure 3 (and all displays of code in the text), we
use the following notations: [ ] indicates an optional
function, { } indicates a choice of one, and () and uppercase letters indicate that the appropriate value
should be inserted. That is, aN and aM are one of the
four accumulator registers noted as a0 through a3; rD
and rS are from the set of CAU registers rl to 1-22.X
and Y may be an accumulator register, the serial input
buffer (ibuf), or a memory location referred to by the
register indirect addressing mode. Z may be the serial
output buffer (obuf), the parallel data register (pdr), or
a memory location. N can be represented as an unsigned
or twos-complement number (limited to 16 bits except
for data-move and control instructions in which it
may be 24 bits). An asterisk before a variable or register indicates the memory location pointed to by that
variable or register. An asterisk operator also indicates multiplication.
The DSP32C instruction set can be partitioned into
two parts according to the two primary execution
units, the DAU and the CAU. The DAU executes instructions for floating-point arithmetic and special
functions. These DAU data-stationary instructions4
execute in a highly pipelined fashion. The CAU executes the remaining instructions, which fall into the
following categories: control, integer arithmetic/logic,
and data move.
DAU special functions. The DAU special-function
instructions convert data formats and conditionally load
accumulators. These singlecycle instructions convert
companded and integer data types to and from floatingpoint data. In addition, one instruction converts a
single-precision, IEEE Std. 754 floating-point number
to and from the DSP32C internal floating-point
representation. Such conversion capability is useful
when inserting the DSP32C into computing environments that use the IEEE standard for floating-point
arithmetic.
With one special-function instruction
zyxwvutsrqpo
zyx
Highlights. Since it is difficult to describe the entire
instruction set in a few pages, we concentrate on some
of its highlights.
Multiply/accumulate instructions. The DAU executes many variations of the multiply/accumulate instructions as can be seen in Figure 3. Here, we limit our
discussion of the multiply and accumulate instructions
to the following format:
[ Z = ] a N = [-]aM{+,-}Y
*X
An example of this instruction is:
* r 3 + +r17 = a0
= a1
+
*rl+
+*
*r2+
+
This instruction performs the following operations.
The contents of memory locations pointed to by registers R1 and R2 are fetched and multiplied together.
The contents of pointer registers R1 and R2 are then incremented. The product formed by this multiplication
is then added to the contents of accumulator Al. The
accumulated result in AI is then stored in accumulator
A0 and is also written to the memory location pointed
to by register R3. The contents of pointer register R3
are incremented by the contents of increment register
R17.
Although DAU instructions employ a fetch-multiplyaccumulate-store pipeline, the pipeline timing is such
that the DSP32C achieves a throughput of one instruction per instruction cycle. As mentioned previously, a
variety of these DAU instructions can be used efficiently
for many signal processing algorithms.
34
IEEEMICRO
aN
=
seed(Y)
users can approximate the reciprocal of a number Y as
the seed value for the Newton-Raphson iteration used
for a divide operation. This instruction eliminates approximately 20 percent of the cycles needed to compute
a division. The division algorithm that incorporates this
instruction provides a result that is precise to 22 bits in
1.5 microseconds.
CAU control instructions. The c A U executes control
instructions that alter the program flow. These control
instructions include branches, conditional branches,
subroutine calls and returns, branches on counter value
instructions, and low-overhead loopings.
Conditional branching can be accomplished under
DAU, CAU, or I/O conditions related to the status of
serial and parallel 1 / 0 buffers. Whether the branch is
taken or not, the instruction following the branch instruction will always be executed, since it will already
have been fetched.
The low-overhead looping construct
do M {K,rH}
can implement a loop with a specified length without
incurring the overhead of a test-and-branch step. The
one-instruction-cycle overhead is required to execute
the Do instruction. This loop instruction executes the
next M + 1 instructions K(or rH) + 1 times, where rH
is the contents of a CAU register. By permitting the
loop counter operand to be placed into a register, the
DSP32C Do-loop construct allows the application program flexibility in computing the number of times a
loop is to be executed. The DSP32C does not restrict
the type of instructions that may be inserted into the
loop-except that the Do instruction cannot be nested.
Threesperand CAU instructions. The CAU has a
RISC-like (reduced instruction-set computing) instruction set. In fact the CAU has a load/store architecture.
Having loaded operands into its general-purpose registers, the CAU can perform arithmetic and logic operations common to most microprocessors, at an execution rate of 12.5 MIPS. The CAU supports threeoperand (triadic) instructions for arithmetic and logic.
An example is:
rD = rS1 - rS2.
zyxwvut
zyxwvutsrqp
zyxwvutsrqpon
In this instruction, the contents of two registers are
subtracted, and the result is stored in a third (destination) register.
ArithmeticAogic instructions that do not have an
immediate operand can be executed conditionally. An
example is:
if (eq) r3 = r l + r2
Here, the contents of registers R1 and R2 are added
together, and the result is stored in R3 only if the result
of the previous CAU instruction is zero (eq). These
conditional instructions save the overhead of a testand-branch sequence common in microprocessors.
CAU move instructions. To load and store the CAU
general-purpose registers, the CAU uses a set of datamove instructions (see Figure 3). For example, the instruction
r3 = *r14+
+
indicates that the CAU register R3 is loaded with the
contents of the memory location pointed to by the
CAU register R14, and the contents of the latter are
then incremented.
These data-move instructions also handle the registers in the parallel and serial I/O units, so that direct
transfers to and from the I/O units and memory can be
performed. The following example shows a store of the
contents of the serial input buffer (Ibuf) into a memory
location pointed to by CAU register R9, with postincrementation of the contents of the CAU register:
* r 9 + + = ibuf
Internal architecture
We now describe the DSP32C architecture a little
more in depth.
Processor cycle. The DSP32C has a processor cycle
time of 80 nanoseconds, at a clock frequency of 50
MHz. Each processor cycle is divided into four states,
numbered 0 through 3; each state equals one period of
the clock input, or 20 ns (50-MHz clock frequency).
Each DAU multiply/accumulate instruction requires
up to four memory accesses: memory read (I instruction), memory read (X operand), memory read (Y
operand), and memory write (Z operand).
In one cycle the DSP32C can perform an instruction
fetch, two operand fetches, and a write to memory or
the I/O. For a given instruction, since the DAU is pipelined, these four accesses do not occur in the same processor cycle. (See Figure 4.)
Figure 5 illustrates a “full pipe” of multiple, sequential DAU multiply/accumulate instructions. The subscripts refer to instruction numbers; for example, Xo is
the X operand for instruction number 0. The DAU
fetches and executes instructions in ascending order.
zyxw
Data arithmetic unit. As stated earlier, the DAU is
the primary execution unit for signal processing algorithms. This unit contains a 32-bit floating-point multiplier, a 40-bit floating-point adder, four 40-bit accumulators, and a DAU control register, the DAUC. The
multiplier and adder work simultaneously to process
computations at the rate of 12.5 MIPS, each of the
form (a = b + c * d). Figure 6 contains a block
diagram of the DAU.
The DAU transmits data to and receives data from
other sections of the chip via the internal data bus. The
DAU is divided into a data path and a control unit. The
data path consists of a floating-point multiplier,
floating-point adder, registers, buses, and bus connectors. The DAU multiplier and adder operate in parallel, each requiring one processor cycle for execution.
The four accumulator registers may be read or written
via program control. The DAU control unit decodes an
instruction into signals that control the data path section. The DAU data path and control unit each contain
zyxwvu
zyxwvutsrqp
instruction
cycle
Figure 4. Internal data bus (pipeline) for a single DAU instruction.
Figure 5. Internal data bus (full pipeline) for multiple DAU instructions.
December 1988 35
z
DSP32C
(40)
(40)
adder
I
DAU data path
AO-A3
DAUC
DU1 .DU2
IR
IR1-IR4
Figure 6. Block diagram of the
139
zyxwvutsrqp
Accumulators 0-3
DAU control register
Adder input operand delay registers
Instruction register
DAU instruction register
(processor cycles 1-4)
P
S
X Y
Product register
Special-function register
Multiplier input registers
zyxw
DAU.
a four-stage (fetch-multiply-accumulate-store) pipeline. Thus, in one processor cycle, the DAU may be
processing four different instructions, each in a different stage of execution.
The DAU contains both 32-bit and 40-bit registers
and buses to support two floating-point formats. In
addition to the standard number of bits in the mantissa, the DAU maintains 8 mantissa guard bits for accumulate operations. Figure 7 shows the two DAU
floating-point formats.
When a 40-bit bus connects to a 32-bit register or
multiplexer, the guard bits are excluded (truncated).
For example, when writing a 40-bit accumulated result
zyxwvutsrqp
16115
Mantissa
Guard
SI7
01
Exponent
zyxwvutsrqponm
Figure 7.
36
DAU control
DAU floating-point formats 32 bits (a) and 40 bits (b).
IEEEMICRO
to memory, the result is first truncated to 32 bits. The
40-bit accumulated result may also be rounded to 32
bits using a DAU instruction. When a 32-bit register
drives a 40-bit bus, the guard bits (bits 15 through 8)
are set to zero on the 40-bit bus.
Each DAU multiply/accumulate instruction involves three floating-point operands. Two of these
operands are multiplied together, and the result is added
to the third. One of the three operands comes from an
accumulator. The result of the addition is stored in an
accumulator and optionally written to memory or an
I/O port. The value in this accumulator may be an intermediate result used in later multiply/accumulate
operations, or it may be a final result to be stored in
memory or sent to the I/O units.
The DAU multiplier inputs can originate from
memory, the S I 0 port, or an accumulator. Multiplier
inputs are 32-bit floating-point numbers with a 24-bit
mantissa and an 8-bit exponent. If the contents of an
accumulator are input to the multiplier, the guard bits
are first truncated. The multiplier always provides one
input to the adder. The other adder input can originate
from memory, the S I 0 port, or an accumulator. This
input is either a 32- or 40-bit operand, depending on its
source. The four accumulators, A0 through A3, provide
the 40-bit operands. Although the multiply/accumulate
structure is rigid, the DSP32C has flexibility in the way
operands are loaded into the three inputs. This flexibility ensures that the DAU will be suitable for a wide
variety of digital signal processing algorithms.
The A0 through A3 accumulators eliminate roundingoff problems to ensure 24-bit precision. Postnormalization logic transparently shifts binary points and
adjusts exponents to prevent inaccurate rounding of
bits when the floating-point numbers are added or multiplied, thus eliminating concerns like scaling and
quantization error. Each adder result, which is then
stored in one of the accumulators, is fully normalized.
All normalization occurs automatically.
Single-instruction data conversions take place in the
DAU to free application programs from the overhead
required for these conversions. The DAU converts
data between the DSP32C internal floating-point format and IEEE Std. 754,32-bit floating-point; 16-bitinteger; 24-bit integer; 8-bit p-law and A-law,6 and 8-bit
linear formats. The DAU also provides an instruction
zyxwvu
to convert a 32-bit floating-point operand to a 32-bit
seed value used for reciprocal approximation in division operations.
Data instructionflow.The DAU employs a straightforward fetch-multiply-accumulate-store pipeline.
Because all floating-point postnormalization is automatic, it does not require additional pipeline stages.
The DSP32C takes advantage of this pipeline when
multiply/accumulate instructions execute one after the
other. Before considering consecutive multiply/
accumulate operations, let’s consider the flow of one
instruction as it passes through the DAU pipeline.
The DAU supports four multiply/accumulate instruction formats, thus providing flexibility in choosing operands and storing results. To simplify this
discussion, we describe just one format. The DSP32C
executes the multiply/accumulate instruction
[ Z = ] a N = aM + Y * X,
in four stages as follows:
X and Y fetch,
multiply X * Y,
accumulate the product with accumulator AM and
store the result in accumulator AN, and
optionally store the result in the location specified
by Z.
If several multiply/accumulate instructions execute
one after the other, the DSP32C automatically
pipelines the instructions so that one instruction completes in every processor cycle. We show this process in
the following block of DSP32 instructions:
1) [Z, = ] aN = aM + Y1 * X I ,
2) [Z2 = ] aN = aM + Y2 X2,
3)[Z3 = ] a N = aM + Y 3 * X3,
4)[Z4 = ] a N = aM + Y4 * X 4 , a n d
5)[Z5 = ] a N = aM + Y5 X5.
Yl-5, and Z1-5 to make it
Again, we subscript
easier to follow the data flow in the DAU. Figure 8
displays the instruction execution.
The programmer must only be aware of 1) when an
accumulate operation is finished, 2) whether the result
accumulated in AN is intended to be used as an input to
the multiplier, or 3) whether the contents of the location specified by Z are intended to be used as an input
to the multiplier or the adder. For example, the con-
zyxwvutsrqp
zyxwvu
*
*
zyxwvut
1)
2)
3)
4)
5)
6)
7)
8)
writel
write2
write3
write4
write5
accumulate,
accumulate2
accumulate3
accumulate4
accumulate5
zyxwvuts
zyxwvutsr
multiplyl
multiply2
multiply3
multiply4
multiply5
XY fetchl
XY fetch2
XY fetch3
XY fetch4
XY fetch5
Figure 8. Execution of DAU instructions.
December 1988 37
z
zyxwv
DSP32C
zyxwvuts
zyxwvuts
zyxwvutsrq
tents of AN accumulated in instruction 1 will be available as a multiplier input in instruction 4, but can be
used as an adder input in instruction 2. The contents of
the memory location or I/O port specified by Z in instruction 1 will be available as an input to the multiplier
and/or the adder in instruction 5 .
The fact that the contents of an accumulator, multiplier input, or adder input can be written to memory or
to an I/O port without requiring an additional instruction is a key feature when performing such tasks as
windowing, adaptive filtering, and matrix operations.
While the four-stage pipeline in the DAU contributes to the high throughput of the DSP32C, it makes
logic and control arithmetic problematic in the DAU.
The CAU performs these functions. CAU instructions
operate on 16- and 24-bit integers and resemble common RISC-style microprocessor instructions. The
CAU is less pipelined than the DAU, and its integer
arithmetic requires one instruction cycle. The CAU can
perform its own operations while the DAU is in various
stages of its pipeline. This arrangement allows the
DAU and CAU instructions to work together.
Internal timing. The internal pipeline timing of the
DAU for a single instruction of the form Z = aN =
aM + Y * X is shown in Figure 9. The instruction requires that five processor cycles (I, through I,) be
decoded and executed in the DAU. Instruction I appears on the data bus in state 3 preceding I o . The X and
Y registers are loaded in states 1 and 2 of I I , respectively. The contents of these registers are multiplied during
1 2 , and the product is loaded into the product register P
in state 3 of I,. The adder input register S is also loaded
in state 3 of 1 2 . The contents of the P and S registers are
added during I , , and the result is loaded into an accumulator in state 3 of I , . This same result is then placed
on the data bus (which is routed to a destination in
memory, the P I 0 port, or the S I 0 port) in state 0 of I,
if a Z-field is specified in the instruction. The Z-field is
optional.
Control arithmetic unit. The CAU seen in Figure 10
performs address calculations, branching control, and
16- or 24-bit integer arithmetic and logic operations. It
consists of a 24-bit ALU that performs the integer
arithmetic and logical operations, a 24-bit program
counter register, and twenty-two 24-bit generalpurpose registers.
The CAU has two modes of operation: one executes
CAU instructions, and the other generates addresses
for the operands of DAU instructions. CAU instructions perform data movement, branching control, and
16- and 24-bit integer or logical operations. Since DAU
instructions can have up to four memory accesses per
instruction, the CAU generates addresses for these
locations. It uses the postmodified, register indirect
addressing mode, and it generates one address in each
of the four states of an instruction cycle.
For instructions in the form of Z
where X. Y =
aM=
aN=
Z=
Figure 9. DAU internal pipeline timing for a single instruction.
38
IEEEMICRO
=
zyxwvuts
aN = aM + Y
+
Memory 110 registers a0 - a3
aO-a3
aO-a3
Memory I/O registers
X
Pointer bus
zyxwvutsr
+
zyxwvutsrq
Byte
select
A
Increment bus
t
M
A
Address
bus (22)
(24)
(24)
zyxwvutsrqpo
I
CAU data path
Update bus
I
IR
PC
R1 -RI 4
R15-Rl9
Pin (1320)
Instruction register
Program counter
CAU general-purpose registers.
DAU pointer registers
CAU general-purpose registers.
DAU increment registers
CAU general-purpose register,
serial DMA input register
Pout(R21)
IVTP(R22)
M
N
4
CAU general-purpose register,
serial DMA output register
CAU general-purpose register.
interrupt vector table pointer
Do-loop counter register (number
of instructions in loop)
Do-loop counter register,
immediate value (number of iterations of loop)
zyxwvutsrqp
Figure 10. Block diagram of the MU.
Registers R1 through R14 function as generalpurpose registers for CAU instructions and as memory
pointers (RP) for DAU instructions. When used for
memory pointers, registers R1 through R14 hold 24-bit
addresses. Registers R15 through R19 act as generalpurpose registers for CAU instructions and as increment registers (RI) for DAU instructions. When used
as increment registers, registers R15 through R19 hold
24-bit values that can postmodify addresses in the
memory pointers. Register R20, called Pin (pointer in),
acts as the S I 0 DMA input pointer, and R21, called
Pout (pointer out), as the S I 0 DMA output pointer.
Register R22, also called IVTP (interrupt vector table
pointer), holds the base address of the interrupt vector
table.
The CAU data-move instructions specify data transfers between CAU registers and memory, CAU registers and I/O registers, and I/O registers and memory,
based on 8-, 16-, 24-, and 32-bit operands. All CAU
arithmetic operations are performed on 24-bit, twoscomplement, integer operands. If an instruction affects flags, the flags are calculated based on flag rules
December 1988 39
DSP32C
Data bus
io
IR
r
a CAU instruction completes before the next instruction begins execution. This process simplifies use of the
CAU for logic and control operations.
While DAU instructions execute, the CAU generates
up to four addresses, one address in each of the four
states of an instruction cycle. In each state, the CAU
can add the contents of two registers: a pointer selected
from registers R1 through R14 (placed on the pointer
bus), and an increment selected from R15 through R19
(placed on the increment bus). The CAU employs a
three-stage pipeline to 1) fetch from a register@), 2)
operate on the fetched operand@), and 3) store the
result in a register. Because of this pipelining, the
preceding pointer is updated (result placed on the update bus), while the next pointer and increment are being accessed. Figure 12 shows the CAU operations for
executing a DAU instruction of the form:
Z = aN = aM + Y * X
Note that the result of the multiply/accumulate
operation is not available on the data bus until state 0
of instruction cycle Is (see discussion of DAU pipeline)
and the destination address is calculated during state 1
of instruction cycle 12.Thus, the address is latched and
held for three instruction cycles before being placed on
the address bus. The Z-address delay block shown in
Figure 10 performs this three-instruction cycle delay.
zyxwvutsrqpo
zyxwvutsrqponm
zyxwvutsrqpon
zyxwvutsrqponm
12
11
10
12
11
Z
Flags
zyxwv
?
For instructions in the form of
lo .rD-rS
I I . if(eq)rD = rS1 + rS2
where rD = Destination register
rS rS , rS2 = Source register
eq = CAU condition equal
to zero, basedon
the CAU z (zero)flag
Modification of registers and
flags IS conditional
PC = Program counter
+ = Addition operation
- = Subtraction operation
7 =
Figure 11. CAU internal pipeline timing for a CAU instruction.
for 16-bit or 24-bit operations, depending on the size of
the operation as specified in the instruction.
For example, when performing operations using 16bit integer arithmetic, 1) 16-bit data is loaded into the
lower 16 bits of the 24-bit CAU register(s), with the
most significant bit of the integer extended into the
upper 8 bits; 2) 24-bit arithmetic is performed, with
flags computed according to 16-bit operation flag
rules; and 3) the results are stored in memory by
writing the lower 16 bits of the register onto a 16-bit
memory location.
Many of the CAU arithmetic instructions can be performed using three different operands: two source registers (RSI, RS2) and one destination register (RD).
CAU instructions can also be executed conditionally,
based on flags generated in the CAU. Figure 11 illustrates the CAU operations for executing two instructions of the form:
zyx
Memory. The DSP32C provides on-chip memory
(RAM only or a RAM/ROM combination) and an external memory interface for off-chip memory expansion. Instructions and data can arbitrarily reside in onchip RAM, on-chip ROM, or external memory. The
addresses of the various blocks of memory can be configured in eight different memory modes.
Internally, the DSP32C contains 4,096 bytes of
RAM that are available in the two memory configurations offered. Also, the chip provides either 8,192 bytes
of mask-programmable ROM or an additional 2,048
bytes of RAM. Thus, two on-chip memory configurations are possible: 4,096 bytes of RAM and 8,192 bytes
of ROM, or 6,144 bytes of RAM.The on-chip RAM is
static and does not need to be refreshed. Users can
mask-program the ROM with an application program(s)
and/or fixed data. In addition, they can secure the contents of ROM, to protect proprietary code from examination, by wire-bonding the corresponding pad
inside the package.
The external memory interface directly addresses up
to 16 Mbytes of memory, with no loss of execution
speed. This interface also supports wait states to accommodate access to slower speed memory and 1 / 0
peripherals, and bus arbitration to accommodate the
need for sharing the external memory interface signals
with other devices. The two-section external memory is
divided into a low partition A and a high partition B.
Users can independently configure the number of wait
states for each partition. Therefore, a mix of slow and
fast memory can be used to provide the necessary
throughput at a reasonable cost.
zyxwvutsrqp
zyxw
rD - rS
if (eq) rD = rS1
+ rS2
The first instruction subtracts the contents of the RS
register from the contents of the RD register to set or
clear the CAU z (zero) flag. The second instruction
checks the z flag, and if it is set (that is, the result of the
previous instruction was zero), adds the contents of
RS1 and RS2 and stores the result in RD. If the flag is
not set, the contents of RD are unaffected by this instruction. Except for the register load from memory
that has a latency of one instruction cycle, execution of
40
IEEEMICRO
zyxwvutsrqponm
zyxwvutsrqponmlkjihgfedcbaZ
zyxwvutsrqp
zyxwvutsr
For instructionsin the form of
I, * r Pz + + r Iz
=aN =aM + * r P e + r I y + ' r Px + + r Ix
where r P z , r Py , r Px = Pointer register for DAU,
2 , Y, and X operands
r Iz , r I y,r I = Increment register for DAU,
Z. Y, and X operands
Figure 12. CAU internal pipeline timing for a DAU instruction.
PC = Program counter
+ = Addition operand
A = Address
zyxwvutsrqpo
Each wait state takes one quarter of an instruction
cycle (20 ns). The onchip address, data latches, and
memory control signals allow a zero-chip interface to a
standard byte-wide memory chip. All instructions are
32 bits wide, fetched in one memory access. Four data
types (8-, 16-, 24- and 32-bit) exist, and the memory is
uniformly byte addressable, with 32-bit data accessed
at the same speed as 8-bit data.
Data bus. The DSP32C employs a von Neumann architecture with one address bus and one data bus. Data
travels throughout the DSP32C via its 32-bit internal
data bus. This data bus supports four data transfers
during each processor cycle: the instruction fetch, two
memory operand reads, and a memory write. The internal data bus accesses all sections of the chip as well
as the external memory interface. The latter interfaces
to a 32-bit external data bus.
Parallel I/O unit. The P I 0 unit consists of a register
file, control unit, and a bidirectional data bus. It allows
the DSP32C to communicate with external devices. An
external microprocessor easily controls the DSP32C
via the parallel port. For example, the P I 0 unit can initiate DMA to or from the DSP32C, reset the device,
halt it, or monitor various conditions internal to it. The
external parallel I/O data bus PDB can be either 8 or 16
bits wide.
The PI0 contains three 16-bit data registers (PDR,
PDR2, and PIR), a 24-bit address register (PAR/
PARE), a 16-bit processor control word (PCW), an
8-bit 110 port register (PIOP), a 10-bit control register
(PCR), a 16-bit error mask register (EMR), and an
8-bit error source register (ESR). These registers control P I 0 transfers and configure error control and interrupt features.
Users can configure the PI0 unit as an 8- or 16-bit
parallel data bus interface. When configured as an
8-bit interface, the P I 0 supports two additional 4-bit
P I 0 ports, which can be individually configured as inputs or outputs and read or written by the DSP32C.
When a port is configured as an input, the DSP32C can
read the 4 bits of data present at the pins and those bits
can be used for data or control.
The P I 0 monitors six maskable, internal error conditions. It can be configured to interrupt an external
device and, if desired, halt the DSP32C when an unmasked error condition occurs.
P I 0 data transfers can be made under program, interrupt, or DMA control. Using P I 0 DMA, an external device can download a program and upload the
results of an operation without interrupting the
DSP32C program in progress. P I 0 bufferempty and
P I 0 buffer-full conditions allow the DSP32C program
or interruptdriven I/O to read and/or write to the P I 0
port.
December 1988 41
zyxwv
zyxwvutsrqpo
DSP32C
Serial I/O unit. The S I 0 unit provides serial communication and synchronization with external devices.
External control pins on the DSP32C allow a direct interface to a time/division/multiplexed (TDM) line, a
zero-chip interface to many codecs (coder/decoders)
and direct DSP32C-to-DSP32C transfers for multiprocessor applications. The S I 0 performs serial-to-paralle1 conversion of input data and parallel-to-serial conversion of output data, at a maximum rate of 16
Mbits/s. This unit is composed of a serial input port, a
serial output port, and an on-chip clock generator. A
serial input/output control IOC register controls the
S I 0 input and output formats.
The serial input and serial output ports are doubly
buffered, making back-to-back transfers possible by
allowing a second transfer to begin before the first has
been completed. The transfer lengths of the serial input
and output words can be configured as 8, 16,24, or 32
bits, and are selected independently of each other. The
input and output data can also be independently
selected to process either the most significant bit or the
least significant bit first.
S I 0 transfers can be made under program, DMA, or
interrupt-driven control. In DMA mode, transfers occur between Ibuf and memory or between memory and
Obuf without program intervention. The serial input
buffer-full and serial output bufferempty flags allow
the DSP32C program or interrupt-driven I/O to read
and/or write to the S I 0 port.
Interrupts
The DSP32C provides a single-level interrupt facility
and responds to six interrupt sources, four internal and
two external. The interrupts are prioritized and are individually maskable. A relocatable vector table con-
bits
-4
Interrupt source
External interrupt 1
PI0 buffer full
+24
1 - 1
+32
+40
+48
+56
Figure 13. Interrupt vector table.
42
zyxw
zyxwvutsrqp
zyxwvutsrqponm
zyxwvutsrq
Address k-32
+16
trols program flow based on the source of the interrupt. The following list presents the interrupt sources
in order of descending priority:
1) External interrupt one (INTREQl);
2) Parallel I/O buffer full (PDF) generated when the
PDR register is loaded;
3) Parallel I/O buffer empty (PDE) generated when
the PDR register is read;
4) Serial input buffer full (IBF) generated when the
serial input buffer is loaded;
5 ) Serial output buffer empty (OBE) generated when
the serial output buffer is emptied; and
6) External interrupt two (INTREQ2).
In response to a given interrupt, the DSP32C
branches to the corresponding address in the interrupt
vector table (Figure 13). This table contains eight pairs
of 32-bit words starting at the location specified in the
IVTP register R22. Before interrupts are enabled, the
IVTP register should contain the base address of the interrupt vector table.
Before servicing an interrupt, the DSP32C will automatically save the state of the machine that is invisible
to the programmer, the DAU accumulators A0 through
A3 (including guard bits), and the DAUC register. The
internal state that is visible to the programmer must be
saved and restored by the interrupt service routine. To
return to the interrupted program, the interrupt service
routine must restore the user-visible state of the
DSP32C (that was saved) and then execute the Ireturn
instruction. This step restores the state of the machine
that is not visible to the user. The low interrupt
overhead results from the use of only two instruction
cycles to enter an interrupt service routine and one instruction cycle to return to the interrupted program.
IEEEMICRO
P I 0 buffer empty
S I 0 input buffer full
S I 0 output buffer empty
External interrupt 2
Reserved
Reserved
Development tools
Users can access both software and hardware tools
to aid in the development of application programs for
the DSP32C.
Software. Software tools used to create, test, and
debug DSP32C application programs at the assemblylanguage level are packaged in the WEDSP32C-SL
Support Software Library. The library includes an
assembler, a link editor, a simulator, and other
utilities, all of which run under the Unix or MS-DOS
operating systems. (Work is in progress to port these
tools to the Macintosh and VMS operating systems.)
The assembler translates the user’s assembly-language program into the binary code used by the
DSP32C. A notable feature of the DSP32C assembly
language is its use of a high-level, C-like syntax. The
accompanying box displays two programming examples.
The assembler generates relocatable code that the
link editor can easily alter. The relocatable code can
reside anywhere in the addressable space and can be
combined with code that was assembled separately.
The simulator program emulates the operations of
the DSP32C program in a nonreal-time environment.
For full program debugging, the simulator allows access to all registers and memories. It also provides an
interface to the DSP32C hardware development system
described later. The simulator’s capabilities include
single-stepping,breakpointing, and execution profiling.
Simulator users can freeze processing on many conditions, such as the execution of a specified instruction,
access of a specified register or memory location, or occurrence of a specified number of 110 events. A file
can supply input data, while another file can capture
output data. Users can refer to memory locations by
their symbolic names rather than their absolute addresses. In addition, users can define complex command sequences and display formats, and invoke them
with one command.
Another very powerful software tool is the optimizing C-language compiler. The DSP32C compiler
allows users to write application programs in the general, high-level C language. Thus, in applications in
which users develop preliminary programs in C, the
source code can be moved to the DSP32C with a minimal amount of time and effort.
The optimizing portion of the C compiler performs
generic optimizations aimed at taking advantage of the
DSP32C instruction set and resources. In addition, the
optimizer can analyze program data flow to satisfy
data dependencies introduced by some pipeline latencies in the DSP32C. Pipeline optimization reorders instructions where possible or uses NOP (no operation)
instructions to flush the pipeline. Some data dependencies cannot be resolved from a C program. Extra user
control of the optimization process is provided at the C
source level for such cases. Hartung et al. offers further information about the C compiler.
Application library. Part of the DSP32C product
line includes a set of common routines written in the
DSP32C assembly language. These routines compose
the WEDSP32C-AL Applications Library. This
library contains many commonly used signal processing functions such as finite impulse response (FIR)
filters, infinite impulse response (IIR) filters, adaptive
filters (LMS algorithm), and fast Fourier transforms
(FFTs). In addition, the library provides commonly
used functions such as sin(x), cos(x), ln(x) and log(x).
As of this writing, the library contains over 60 subroutines-which also execute on the DSP32.’p8 The
DSP32C C compiler can call these Application Library
subroutines.
zyxwvuts
zyxwvutsrqp
zyxwvuts
zyxwvuts
zyx
C compiler. The DSP32C C compiler is a complete
implementation of the C language. It supports all integer and single-precision floating-point data types.
The compiler is an implementation of the Unix portable C compiler, which guarantees portability of C
programs from Unix System V machines to the
DSP32C.
In addition to handling generic operations, like + ,
-, *, /, %, &, I, and so on, the DSP32C C compiler
takes full advantage of the DSP32C multiply/accumulate instructions. For example, complex operations,
often found in signal processing, such as a = b + (c
d) are compiled into one instruction.
*
Hardware. The WEDSP32C-DS development system
supports application system hardware development
and real-time evaluation of DSP32C programs. The
DSP32C-DS development system, a PC-based family
of five boards, can be used to 1) develop DSP32C programs and test them in real-time and 2) perform incircuit emulation of user-target hardware. The modular design of the development system allows multiple
configurations to achieve various development environments. The DSP32C simulator, running on an
AT&T PC6300 (or compatible), provides a user inter-
zyxw
December 1988 43
z
DSP32C
zyxwvutsrqp
zyxwvutsrq
/ * 3x3 x 3x1 matrix m u l t i p l y f o r DSP32C
#define
#define
#define
#define
matA r l
matB r 2
matC r 3
dec r15
mat3xl:
a0
a0
*matC++ = a0
end: nop
zyxwvutsrq
zyxwvutsrqp
matA = A
matB = B
matC = C
dec = -8
do 2 . 2
*/
/ * a d d r e s s of
i*
A[O,O]
a d d r e s s of BCO.01
/ * a d d r e s s of C [ O , O ]
/*
*/
*/
*/
dec i s used t o r e - i n i t matB p t r
= *matB++ * *matA++
= a0 + *matB++ * *matA++
= a0 + *matB++dec * *matA++
zyxwvutsrqp
zyxwvutsr
/ * Data f o r matrix A .
*/
. r s e c t ‘.rami’
/ * Assembler d i r e c t i v e .
*/
B:
float 1.0,2.0,3.0
/*
C:
3 * f l o a t 0.0
/ * S t o r e s r e s u l t of C h e r e * /
A:
*/
float 1.0.2.0.3.0
float 5.0,6.0,7.0
f l o a t Q.O.10.0.11.0
T e s t d a t a f o r matrix B . * /
Figure A. A simple, codeefficient DSP32C programming exampk (3 x 3) x (3 x 1) matrix multiply.
The first four #define statements are simply preprocessor commands that instruct the preprocessor
to replace a specific variable with a DSP32C register
each time the variable is encountered in the program. These optional statements help make the resulting code more readable, since we chose the variable names to be meaningful (matA is the variable
that points to elements in matrix A, and so on).
Each line is a single-word instruction. The four
instructions beginning at the label mat3 x 1 initialize the pointer variables (registers) used in the
matrix multiply. A value of -8 decrements the
pointer to matrix B by two 32-bit locations (recall
that the address space is byte addressable, so 8 bytes
equal two 32-bit locations).
Once the pointers are initialized, only four instructions compute the matrix multiplication. The
first instruction (do 2,2) is a low-overhead looping
instruction. Its syntax is “DO the next N + 1 instructions M + 1 times,” which, in this case, translates to execute the following block of three instructions three times. Branching to re-execute the block
of three instructions incurs no overhead.
Each of the three instructions in the loop come
from the DAU multiply/accumulate group. Borrowing from the C programming language, we use
an asterisk before a variable or register to indicate
44
IEEEMICRO
the memory location pointed to by that variable or
register. The asterisk operator also indicates multiplication. If + follows a variable or register, it indicates postincrementation by one memory location
(a 32-bit location for multiply/accumulate instructions). In the third multiply/accumulate instruction,
the variable matB (in register R2) is followed by
+dec, which means postincrementation by the
contents of the variable dec (previously defined to be
in register R15). In this case, dec has been initialized
to -8, which re-initializes the variable matB to point
to the first element of matrix B.
Each pass through the loop multiplies a row of the
square matrix A by the column matrix B and writes
the summed result as a new entry to the column
matrix C. Note that the write to matrix C is performed in the last multiply/accumulate instruction
and avoids the necessity of a separate data move
instruction.
+
+
Example 2
This second example shows the assembly code for
an in-place, 4,0%-point, complex, radix-2, decimation-in-time FFT routine.
zyxwvutsrq
zyxwvutsrqpon
zyxwvutsrqpon
call -fft (r14), nop
/* COMPLETE DSP32C assembly code listing for FFT */
4096, 12, Real, Imag
/* Arguments for 4096 p t complex FFT */
ur a2
ui a3
/* Generalized FFT subroutine for in-place complex radix 2 DIT FFT */
Jft
rlOe==*rl4++
/* Get arguments - N */
r7e-*r14++
/* l W N ) */
r2e-*r14++
/* Pointer to Real */
r17e-*r14-/* Pointer to Imag */
r9e=2
r7e-r7-2
bitrev r 17e-r 17-r2
/* IN-PLACE BIT REVERSAL */
r19e---r17
r 16e-r10*2
r18e-rlG3
rle-r2
rle-r2
A
i f (ge) pcgoto B, r5e-rl
*rl++r17
a0 = *r2++rl7
*rl++r19
a0 *r2++r19
*r2++r17
a0 *r5++r17
*r2++r19
a0 *r5++r19
B
rle-r1+4
if ( r l g - >- 0) pcgoto A, r2e-r2#r16
r8e-1
/* FFT CALCULATION */
fftc
r4e-W
/* Initialization */
rl3e-one
r6e-two
r10e-r10/2
C
rge-r9*2
r 15ePr9*2
r15e-rl5-rl7
/* DFT Stage initialization */
r2e-514
ur-*rl3++
ui-*rl%r16e-r&2
r8e-r8*2
rle-r2
D
rlle-rl
r3e-rl+r9
/* Twiddle calculation initialization */
r12e-r3
rl8e-rl0-1
butfly do 5, r18
/* BUTTERFLY (6 instruction cycles) */
a0 *rl++r17 + ur *r3++r17
/: T
R(i)+Ur*R(k) */
*rll++r17-a0--a0*r3++r19 *ui
/ R(i)-T- T-UiII(k) */
a1 *rl++rl9 + ui**r3++r17
/* T
I(i)+Ui R(k) */
*r12++r17-a0--a0+*rl++rl7* *r6
/* R(k)- -T+2*R(i)
*/
*r 11+ +r 15=ai-ai+ *r3++r 15*ur
/* I i)==T- T+Ur*I(k) */
*rl2++rl5-al--al+*rl++r15**r6
/*
-T+2*1(i)
*/
twid
a0 ur *r4++
/* Compute next twiddle */
ur a0 ui * *r4
/* u = u * w
*I
ur * *r4ui a0 + ui * *r4
i f r16- >- 0) pcgoto D, r2e=r2+4
i f 117- >= 0) pcgoto C, r4e=r4+8
E
goto r14+4, nop
two
float 2 0
/* Constant, Twiddle and Data tables * /
one
float 1 0, 0 0
w
float -1 0000000,OOOooOOO,O0000000,-1 0000000,O 7071068,-0 7071068
float 0 9238795,-0 3826834,O 9807853,-0 1950903.0 9951847,-0 0980171
float 0 9987955,-4 9067674e-2,O 9996988,-2 4541228e-2,O 9999247,-1 2271538e-2
float 0 9999812,-6 1358846e-3,O 9999953,-0 0030676,O 9999988,-0 00153398
/* Data for f(n)-sin(nr/2)+sin(nr/4+r/4) * /
=ox004000
512*float 0 707,2 ,O 707,-1 ,-0 707,O ,-0707,-1 /* Real data in true order */
Real
Imag
4096'float 0 0
/* imaginary data */
main
int24
#define
#define
-- -- --
-
---
-
zyxwvuts
zyxwvutsrqp
$k)-
-
1
zyx
December 1988 45
DSP32C
zyxwv
zyxwvutsrq
zyxwvu
zyxw
face. The simulator allows the user to download programs to the device on the development system and to
execute them there. The user can set breakpoints on instruction fetches, allowing the memory, registers, and
accumulators to be examined and modified. The PC
communicates with the DSP32C device directly
through its parallel 110 port.
The DSP32C Development Card lets the user develop software algorithms in a PC environment. This
AT&T PC6300 (or compatible) plug-in card contains a
DSP32C device and a 16K x 32-bit, high-speed static
RAM that can be upgraded to a 64K x 32-bit static
RAM. The DSP32C serial 1 / 0 port interfaces to an onboard codec that is socketed to accept either an AT&T
T7520A high-precision codec with filters or a pincompatible device. Two mini-coaxial connectors on the
development card allow users to supply analog 1 / 0 to
the codec. Users can bypass the on-board codec and interface directly to the DSP32C S I 0 port through a
34-pin connector on the card. The upper byte of the
DSP32C P I 0 port is also available to users as I/O bits
via a 16-bit connector on the card.
The DSP32C Development Card can be used independently or with a half-height daughter board, which
increases the available memory by 1M x 32-bit data.
This DRAM Extended Memory Card interfaces the
DSP32C on the development card to four banks of
256K x 32-bit dynamic RAM. When both cards are
used together, they occupy one and a half PC slots.
The DSP32C In-Circuit Emulator Card emulates
target system hardware in real time. This card contains
a DSP32C device; a 133-pin, PGA-adapter socket,
which plugs into the target hardware; and buffers to
allow the user to access the DSP32C P I 0 port during
breakpoint operation. A 44-pin ribbon cable connects
the card to the PC via a half-height PC plug-in card
called the DSP32C PC Bus Card.
The PC Bus Card provides the parallel communication between the PC and the DSP32C on the in-circuit
emulator. Used in conjunction with the PC Bus Card,
the emulator card allows the user to run breakpointed
code entirely from the target system hardware. The PC
Bus Card can also be used with the DSP32C Multi-InCircuit Emulator Card, which allows the user to multiplex up to four emulator cards to a single PC slot. The
stand-alone DSP32C multiemulator interface also interfaces to the PC over a 44-pin ribbon cable connected
to the PC Bus Card. When using this interface, the user
can select any single emulator card for communication
with the host PC. The software operates in a polled
mode to control this operation.
Performance benchmarks, applications
Table 2 lists a suite of signal processing benchmarks
set for the DSP32C. The FFT benchmark does not include bit-reversed ordering of the output. Note that an
46
IEEEMICRO
Table 2.
DSP32C signal processing benchmarks.
Benchmarks
Time
FIR filter
LMS adaptive filter
5 multiply second-order
section (IIR)
Lattice filter
Divide
(3 x 3)(3 x 1) matrix
multiply
256-point window
1,024-point, complex FFT
80 ns/tap
160 ns/tap
400 ns
160 ns/section
1.5 ps
720 ns
20 ps
2.9 ms
in-place, 1,024-point, complex FFT with ordering of
outputs takes 3.2 ms.
The availability of the C compiler means that highlevel programming language benchmarks are meaningful for the DSP32C. We have measured the singleprecision Whetstone benchmark and the Dhrystone
benchmark and found that the DSP32C compares very
favorably with state-of-the-art microprocessors. The
following benchmark results were obtained from a
C-code implementation of the benchmarks compiled
with the DSP32C C compiler:
6.46 million Whetstones per second, and
14,336 Dhrystones per second.
As can be seen from the features we've described,
the DSP32C is suitable for use in many different application areas such as telecommunications, speech
processing, image processing, graphics, array processors, robotics, studio electronics, instrumentation,
and military applications. Table 3 lists some of the applications.
he DSP32C upwardly compatible extension to
the DSP32 architecture excels in many signal processing applications l0-'3 as well as in nontraditional DSP chip applications. The enhancements in the
DSP32C widen the spectrum of potential applications
for the DSP32C. We have described the architecture
and instruction set of the DSP32C and its support
products including the C compiler, which zilows an
easy implementation of many applications. I
T
zyxwvu
Acknowledgments
We gratefully acknowledge the contributions of
C.N. Tanga, N. Agrawal, and J.E. Beck for their work
on the design, support, and verification of the
Table 3.
DSP32C applications.
Application
Example
zyxwvutsrqpon
zyxwvut
Electronic data processing
Mass memory
Disk controllers, highprecision servo control
Workstations
Graphics, translations,
rotations, shading, perspective scaling, inversion,
multiplication, numeric accelerators, array processing
Front-end
Bit-manipulation, encryption
processor
Industrial
Robotics
Image
processing
Process control
Real-time
simulators
Instrumentation
High-precision servo control
Restoration, pattern
recognition, compression
Minicomputer functions
Graphics, servo control,
system modeling
Oscilloscopes, FFT, spectrum
analysis, signal generators
Telecommunications
PBX
Tone detection, tone
generation, MF, DTMF
Switches
Tone detection, tone
generation, line testing
Modem
Echo cancellation, filtering,
error correction and
detection
Transmission
Multipulse LPC, ADPCM,
transmultiplexing,
encryption
Government/military
Sonar
ECM
Airframe
Radar tracking
Speech
Recognition
Synthesis
Coding
Consumer
Studio
electronics
Entertainment
Educational
Beam forming, FFT
FFT, adaptive filtering
Simulation
Precision FFT, matrix
inversions
DSP32C; and our design partners in R.N. Kershaw’s
group in the Digital IC Design Department for their
work on the circuit design and layout of the device. We
also thank M. Murphy and A.A. Pignone for their
design of the development system; J. DeHart for developing several of the DSP32C software tools; and the
DSP Systems Development Group under J. Hartung
for developing the C compiler, the Applications
Library, and several applications. J.R. Boddie and
R.A. Pedersen managed the DSP32C project.
zyxwvut
zyx
zyx
zyxwvut
References
1. J.R. Boddie et al., “A Floating Point DSP with Optimizing C Compiler,” Proc. IEEE Int’l Conf. Acoustics,
Speech and Signal Processing, Vol. 88CH2561-9, Apr.
7-11, 1988, pp. 2009-2012.
2. W.P. Hayes et al., “A 32-Bit VLSI Digital Signal Processor,” IEEE J. Solid State Circuits, Vol. SC20, No. 5 ,
Oct. 1985, pp. 998-1004.
3. ANSI/IEEE Standard 754-198.5f o r Binary FloatingPoint Arithmetic, IEEE Comp. Soc., Los Alamitos,
Calif., 1985.
4. P.M. Kogge, “The Microprogramming of Pipelined
Processors,” Proc. Fourth Ann. Conf. Computer Architecture, Mar. 1977, pp. 63-69.
5. W.H. Press et al., Numerical Recipes-The Art of
Scientific Computing, Cambridge University Press,
1986, p. 254.
6. A. Kundig, “Digital Filtering in PCM Telephone
Systems, ” IEEE Trans. Audio and Electroacoustics,
Vol. AU-18, No. 4, Dec. 1970, pp. 412-417.
7. J.R. Boddie et al., “The Architecture, Instruction Set
and Development Support for the WEDSP32 Digital
Signal Processor,” Proc. IEEE ICASSP, Vol.
86CH2243-4, Apr. 7-11, 1986, pp. 421-424.
8. WEDSP32 Support Software Library, AT&T, Allentown, Pa., July 1986.
9. J. Hartung et al., “A Practical C Language Compiler/
Optimizer for Real Time Implementations on a Family
of Floating Point DSPs,” Proc. IEEE ICASSP, Vol.
88CH2561-9, Apr. 7-11, 1988 pp. 1674-1677.
Feature extraction, spectrum
analysis, pattern matching
LPC, formant synthesis
ADPCM, LPC, multipulse
LPC, vector quantization
10. R.V. Cox et al., “Implementation and Application of an
Embedded Subband Coder,” Proc. Int’l Conf. Communications, Vol. 88CH2538-7, Jun. 12-15, 1988, pp.
90-95.
11. J. Tow et al., “Implementation of DSP Applications Using the AT&T DSP32C C Compiler and Applications
Library,” Proc. Int’l Syrnp. Circuits and Systems, Vol.
88CH2458-8, Jn. 7-9, 1988, pp. 1061-1065.
Digital audio
12. R.V. Cox et al., “New Directions in Subband Coding,”
High-end video (special
effects)
zyxw
IEEE J. Selected Areas in Commun., Special Issue on
Voice Coding for Commun., Vol. 6, No. 2, Feb. 1988,
pp. 391-409.
13. J. Tow, “Implementation of Digital Filters with the
WEDSP32 Digital Signal Processor,” Application Note,
AT&T Microelectronics, Allentown, Pa., Apr. 1988.
December 1988
47
zyxwv
zyxwvutsrq
DSP32C
Michael L. Fuccio was lead engineer for the
DSP32C project in the AT&T DSP Design
Group. He is presently on temporary assignment at the City College of New York
where he is teaching computer architecture
courses. He has been involved in the architecture of 32-bit microprocessor chips
and peripherals, and design and testing of
the WEDSP16 DSP. His interests include
the architecture of DSPs, adaptive signal
processing, multiprocessing, and engineering education.
Fuccio received the BEEE degree from the State University
of New York at Stony Brook and the MSEE degree from
Georgia Institute of Technology. He is presently pursuing
further graduate study at Rutgers University in New Jersey.
He is a member of the IEEE Computer Society.
Benjamin Ng is a member of the technical
staff in the DSP Design Group. He was a
member of the architecture team for the
WE32100 chip set and an architect for the
WE32201 MMU/data cache. Then he
worked on the architecture and design of a
graphics processor. His areas of interest include computer architecture and computer
graphics.
Ng received his BSEE from Cornell
University, New York, and his MSEE from Columbia
University also in New York. He is a member of the IEEE
Computer Society.
zyxwvutsr
Renato N. Gadenz is a distinguished
member of the technical staff working in
the DSP Design Group. He was a member
of the team that developed the first Bell
Labs programmable DSP and has been
connected since then with the design,
simulation, and/or testing of each member
of the AT&T DSP family. Previously, he
had designed active filters, developed software for testing microprocessors and microprocessor systems, and worked on mixed analog and
digital ICs. Before joining Bell Labs, he was a professor and
chair of the Departamento de Electricidad at the Universidad
de Chile and a lecturer and associate research engineer at the
University of California at Los Angeles.
Gadenz received his Ingeniero Civil Electricista degree
from the Universidad de Chile in Santiago, his MSEE degree
from the University of Pittsburgh, and his PhD degree from
the University of California, Los Angeles. He is a member of
the IEEE.
Craig J. Garen, as supervisor of the DSP
Design Group at AT&T Bell Laboratories,
most recently supervised the design of the
DSP32C. He has contributed to the design
of three general-purpose programmable
DSPs, the DSP20, DSP16, and DSP32.
Garen received his BSEE from Lehigh
University in Pennsylvania and his MSEE
from California’s Stanford University. He
is a member of Tau Beta Pi and Eta Kappa
Nu.
Steven P. Pekarich currently works as a
member of the technical staff in the DSP
Design Group. He was responsible for the
architecture and design of the CAU module
in the DSP32C. Current interests include
CPU and DSP architectures and the design
of CAD tools for architectural studies.
Pekarich received his BSEE from Monmouth College and his MS in computer
science from Stevens Institute of Technology, both in New Jersey. He is a member of Eta Kappa Nu,
Sigma Pi Sigma, and ACM, and an associate member of the
IEEE Computer Society.
Kreg D. Ulery is a member of the DSP Product Management Group in AT&T Microelectronics. He has worked in product engineering, application engineering, and technical support for the DSP product family.
Prior to this assignment, he worked at Bell
Laboratories where he participated in digital and analog design projects in the DSP
Design Department.
Ulery received a BSc degree from Pennsylvania’s Messiah College and an MSEE degree from
Rutgers University in New Jersey.
Questions concerning this article can be directed to Renato
N. Gadenz, AT&T Bell Laboratories, Rm. 2D-235, Holmdel,
NJ 07733.
Joan M. Huser is a member of the technical
staff working in the DSP Design Group.
She has been involved in a variety of digital
signal processing areas, including VLSI
design, hardware development system
design, applications programming, and
technical support.
Huser received BS degrees in electrical
engineering and computer science from
Washington University in St. Louis,
Missouri, and an MS degree in electrical engineering from
Stanford University.
48
IEEEMICRO
Reader Interest Survey
zyxw
Indicate your interest in this article by circling the
appropriate number on the Reader Interest Card.
Low
156
Medium 157
High 158