The DSP32C: AT&amp;Ts second generation floating point digital signal processor

M.L. Fuccio; R.N. Gadenz; C.J. Garen; J.M. Huser; B. Ng; S.P. Pekarich; K.D. Ulery

zyxw zy zyxwvutsr The DSP32C: AT&T’s Second-Generation Floating-Point Digital Signal Processor DSP32 mpatibility, a 12.5-MIPS execution rate, and ease of use ensure the success of a programmable chip designed for computatiop1intensive applications T he WEDSP32C high-performance, programmable digital signal processor supports 32-bit floating-point arithmetic and is upwardly compatible with its predecessor, the WEDSP32. (See Figure 1.) Because it is implemented in 0.75-pm (effective channel length) CMOS technology, the second-generation device achieves high functional density with low power consumption. At a clock rate of 50 MHz, the DSP32C executes 12.5 million instructions per second. This performance implies that it is capable of performing 25 million floating-point operations per second. The device also performs 24-bit integer operations at the rate of 12.5 million operations per second. These high performance rates permit users to program the DSP32C to implement a wide variety of computation-intensive applications. Floating-point arithmetic removes a programmer’s problems with number scaling-problems that are endemic to fixed-point-arithmetic signal processors. Besides floating-point-arithmetic capabilities, the DSP32C ease-of-use features simplify its insertion into DSP hardware and software environments. Availability of software and hardware development tools, including an optimizing C-language compiler, also ensures an ease of use previously unavailable with programmable digital signal processors. The following features allow the DSP32C’s insertion into nontraditional uses of DSP chips such as computer graphics and industrial control: 25-Mflop operation; l6-Mbit/s serial-input and serial-output ports; a 16-bit, parallel I/O port for control and data transfer; interrupt facilities; single-instruction p-law and A-law data conversions; single-instruction conversions between integers and floating-point data; a byte-addressable, on-chip memory that is extendable off chip; direct memory access to and from internal and external memory via parallel and serial I/O ports; 16 Mbytes of address space; and IEEE Std. 754 floating-point format conversion. 3 While remaining upwardly compatible with the DSP32, the DSP32C offers the several enhancements listed in Table 1. The DSP32C contains 405,000 transistors in an area of 88 square millimeters that is enclosed in a standard, 133-pin-square PGA (pin grid array) package. zyxwvu zyxwvuts zyxwvutsrqp Michael L . Fuccio, Renato N . Gadenz, Craig J. Garen, Joan M. Huser, Benjamin Ng, and Steven P. Pekarich A T&T Bell Laboratories zyxwvutsr zyxwvutsr Kreg D. Ulery A T&T Microelectronics 30 IEEEMICRO 0 2 7 2 ~ l 7 3 2 / 6 8 / l 2 ~ 3 00 ~ l1988 . ~ IEEE zyxw zyxwvutsrq zyxwvuts Figure 1. Microphotograph of the DSPSX. Here, we describe the DSP32C’s instruction set, architecture, and application development tools. The latter includes an assembler, a simulator, an optimizing C compiler, and special-purpose hardware. Table 1. DSP32C enhancements. An overview of the architecture Figure 2 on the next page shows the block diagram of the DSP32C. Two execution units, the control arithmetic unit (CAU) and the data arithmetic unit (DAU), achieve the high throughput characteristic of the device. Each unit has its own instruction set. The CAU performs 16- or 24-bit integer arithmetic for logic and control functions, while the DAU performs 32-bit floating-point and data conversion operations. The CAU can function in an autonomous fashion, performing data transfers, branching control, and integer arithmetic and logic operations in parallel with the DAU floating-point operations. In addition, the CAU generates addresses for the DAU operands. Besides the program counter, the CAU contains 22 general-purpose registers, which can be used for CAU instructions. The contents of registers R1 through R14 also function as memory pointers, and the contents of registers R15 through R19 as pointer increments in DAU instructions. With the exception of a register load from memory, execution of a CAU instruction completes before the next instruction begins execution. This feature simplifies the use of the CAU for logic and control operations. The DAU, on the other hand, employs a four-stage pipeline to perform 25 million floating-point computations per second. Configured for multiply/accumulate operations, the DAU is the primary execution unit for signal processing algorithms. It contains a floatingpoint multiplier and a floating-point adder that work zyx zyx zyxwvutsrq in parallel to perform computations of the form a = b +c*d. The DAU has four 40-bit accumulators. It employs a straightforward fetch-multiply-accumulate-store pipeline, which we explain in detail later. Briefly, the DAU executes a multiply/accumulate instruction in four stages: fetch of c and d, multiplication of c and d, accumulation of the c and d product with b (with the result stored in an accumulator), and an optional write of the result to memory or an 11’0 port. A maximum of two of the three multiply/accumulate operands can come from memory. T h e other operand comes from an accumulator register. December 1988 31 DSP32C zyxwv PABO-PAB3 PDBOO-PDB15 PEN[ ROMO- PGN PDR2 (16) PWN PDF PCR (10) PIF EMR (16) RAM2 RAM1 RAM 512 x 32 RAM 512 x 32 zyxwvutsrqp OR 4x512 words x 32 bits 1-77 t Pipeline control IR Floating- multiplier I OSE OEN C- IR1 C- IR2 Floatingpoint C- AO-A3 (40) C- CKI. RESTN, ZN. BREQ INTREQ1, lNTREQ2 Utility pins I DAUC I IACK1. IACK2. BRACKN AO-A3 ALU CAU DAU DAUC EMR ESR lbuf IOC IR IR1-IR4 zy zyxwvu r zyxwvutsrqponmlk I 4 I zyxwvutsrqpon r zyxwvutsr c DAU ILD Dl Accumulators 0-3 Arithmetic logic unit Control arithmetic unit Data arithmetic unit DAU control register Error mask register Error source register Input buffer Inputloutput control register Instruction register Instruction register pipeline ISR IVTP Obuf OSR PAR PARE PC PCR PCW PDR PDR2 Input shifl register Input vector table pointer Output buffer Output shifl register PI0 address register PI0 address register extended Program counter PI0 control register Processor control word PI0 data register PI0 data register 2 CAU 4 I ' I IR3 c ALU 16/24 PC (24) R1 - R14 (24) R15- R19 (24) Pin (24) Pout (24) IR4 IVTP (24) Pin PI0 Plop PIR Pout Rim9 RAM ROM SI0 Serial DMA input pointer Parallel 110 unit Parallel 110 port register PI0 interrupt register Serial DMA output pointer Registers 1 through 19 Read/write memory Read-only memory Serial 1/0unit zyxwvuts zyxwvutsrqponm Figure 2. Block diagram of the DSP32C. 32 IEEEMICRO zyxwvutsrq zyxwvutsrqponmlkji lnteaer Arithmet'Wloaic Instructions Floatino Point Arithmetic Insiructioq - [Z-] a N = [ - I aN [-] = [-] [-I a N = [-1 [Z-I a N = [ - I [Z-] a N = [-] aN = [ - ] [Z-] a N [Z-I a N - a M (t,-) Y * X a M ( t , - ) (Z Y) Y (t,-] a M * X Y * X (Z = Y ) * X Y (+,-t X Y ( 2 = Y) (+,-) x - X [Z-I a N [Z=l aN [Z=] a N [ Z x ] aN [Z=] a N [ Z = ] aN [Z-] a N [Z-] a N [ Z - ] aN [Z=] aN = - = = = = = = = [ Z = ] aN = [z=] aN = [ Z = ] aN = ic(Y) p-Law,A-Law, &bit linear to float oc(Y) float(Y) float24(Y) int(Y) int24(Y) round(Y) ifalt(Y) ifaeq(Y) ifagt(Y) dsp(Y) ieee(Y) seed(Y) lloatto p-Law, A-Law, &bit linear 16-bitinteger to float 24-bit integer to float Convertfloat to lbbt integer float to 24-bil integer ~ o u n to d 3 2 M float If a<O then move Y to aN If a=O then move Y to aN If *O then move Y to aN IEEE float to DSP32 format DSP32 float to IEEE format Compute approximate reciprocal of Y noP goto ( N , r H , r H + N ) i f (cOND) g o t o ( N , r H , r H + N ) if(rM-->=O) goto ( N , r H , rH+N) rH+N) (rM) Assiinment/negate Increment/ck?crement Add registers(triadic) Add register to constant Subtract registers (triadic) Subtract register from constant Logical AND registers (triadic) LogicalAND constant with register LogicalOR registers(triadic) Logical ORconstantwithregister Logical XOR registers(triadic) LogicalXOR constant with register Reverse carry add registers Logical AND with register complemented Arithmetic right shift Logical right shift Rotate right through carry Arithmetic left shift Rotate lefl through cany - zyxwvutsrqpo Control Instructions c a l l (N, rH, return (rM) ireturn do M, ( K , r H ) - rD [-I r S r D = rS ( t , - t 1 rD = rS1 t rS2 r D = rS t N r D = rS1 rS2 r D = - rD t N r D = rS1 L rS2 rD = rD & N rD r S 1 I rS2 rD = rD I N r D = r S 1 A rS2 r D = rD N r D = rS1 # rS2 rD = r S 1 6 - 2 2 - Data can be written using bit reversed addressing. SDecialFunctions Both 16 and 24-bit integer operations may be specified. zyxwvutsrqp zyxwvutsrqpo zyxwvutsrqponmlkj * No operation Branch Conditional branch Conditionalbranchwith loop counter Call subroutine Returnfrom subroutine Return from interrupt Do M+l instructions K (or rH) +times I rD rD .- rS / 2 r S >> 1 rD = rS r D = rS rD = rS >>> * <<< rS1 - ( N , rS2 & ( N , 1 2 1 Compare Bit test rS2 rS2 Instructions that do not use the constant, N, may be conditionally executed as shown in the following example: if(eq) r l = r2 + r3 Data Move lnsrructioq rD = N (Z, *M, o b u f , p d r , p i r , pcw) = rS i b u f , p d r , p i r , pcw) Z = l i b u f , p d r , p i r , pcw) ( o b u f , p d r , p i r , pcw) = Y r D = IY, *N, Figure 3. DSP32C instruction set Data travels throughout the device via a 32-bit data bus as seen in Figure 2. This bus supports four memory accesses during each instruction cycle: an instruction fetch, two operand reads, and a write to memory. This high-speed data bus and the pipelined architecture of the device allow the DSP32C to fetch two 32-bit operands from memory, perform multiply and accumulate operations, and write a result to an I/O port or memory during each instruction cycle. The DSP32C provides on-chip memory and an external memory interface for off-chip memory expansion. Optional memory configurations allow users to download programs and data into three 512-word RAM banks on the chip, or to substitute an 8-Kbyte ROM for the third RAM bank. A 24-bit addressing capability increases the external memory capacity to 16 Mbytes. Programs can treat memory as a common resource, with instructions and data arbitrarily residing in onchip RAM, on-chip ROM, or external memory. The external memory interface supports wait states and bus arbitration. The parallel I/O, or PIO, port provides a parallel interface for communication between the DSP32C and external devices. It can be configured as an 8-bit (DSP32compatible) or as a 16-bit port. The serial 110, or SIO, port provides serial communication and synchronization with devices outside the DSP32C. Three on-chip DMA controllers support direct, independent memory access via the serial input, serial output, and parallel I/O ports. A single-level interrupt facility can respond to four internal and two external, individually maskable sources. A relocatable vector table controls program flow based on the source of the interrupt. Instruction set Figure 3 summarizes the instruction set of the DSP32C. The enhancementswith respect to the DSP32 appear in boldface type. The assembly-language syntax used in the DSP32C instruction set is similar to the C programming language. This similarity is advantageous when moving algorithms that were developed in a higher level language such as C or Fortran onto the DSP32C. December 1988 33 DSP32C z zyxwv - In Figure 3 (and all displays of code in the text), we use the following notations: [ ] indicates an optional function, { } indicates a choice of one, and () and uppercase letters indicate that the appropriate value should be inserted. That is, aN and aM are one of the four accumulator registers noted as a0 through a3; rD and rS are from the set of CAU registers rl to 1-22.X and Y may be an accumulator register, the serial input buffer (ibuf), or a memory location referred to by the register indirect addressing mode. Z may be the serial output buffer (obuf), the parallel data register (pdr), or a memory location. N can be represented as an unsigned or twos-complement number (limited to 16 bits except for data-move and control instructions in which it may be 24 bits). An asterisk before a variable or register indicates the memory location pointed to by that variable or register. An asterisk operator also indicates multiplication. The DSP32C instruction set can be partitioned into two parts according to the two primary execution units, the DAU and the CAU. The DAU executes instructions for floating-point arithmetic and special functions. These DAU data-stationary instructions4 execute in a highly pipelined fashion. The CAU executes the remaining instructions, which fall into the following categories: control, integer arithmetic/logic, and data move. DAU special functions. The DAU special-function instructions convert data formats and conditionally load accumulators. These singlecycle instructions convert companded and integer data types to and from floatingpoint data. In addition, one instruction converts a single-precision, IEEE Std. 754 floating-point number to and from the DSP32C internal floating-point representation. Such conversion capability is useful when inserting the DSP32C into computing environments that use the IEEE standard for floating-point arithmetic. With one special-function instruction zyxwvutsrqpo zyx Highlights. Since it is difficult to describe the entire instruction set in a few pages, we concentrate on some of its highlights. Multiply/accumulate instructions. The DAU executes many variations of the multiply/accumulate instructions as can be seen in Figure 3. Here, we limit our discussion of the multiply and accumulate instructions to the following format: [ Z = ] a N = [-]aM{+,-}Y *X An example of this instruction is: * r 3 + +r17 = a0 = a1 + *rl+ +* *r2+ + This instruction performs the following operations. The contents of memory locations pointed to by registers R1 and R2 are fetched and multiplied together. The contents of pointer registers R1 and R2 are then incremented. The product formed by this multiplication is then added to the contents of accumulator Al. The accumulated result in AI is then stored in accumulator A0 and is also written to the memory location pointed to by register R3. The contents of pointer register R3 are incremented by the contents of increment register R17. Although DAU instructions employ a fetch-multiplyaccumulate-store pipeline, the pipeline timing is such that the DSP32C achieves a throughput of one instruction per instruction cycle. As mentioned previously, a variety of these DAU instructions can be used efficiently for many signal processing algorithms. 34 IEEEMICRO aN = seed(Y) users can approximate the reciprocal of a number Y as the seed value for the Newton-Raphson iteration used for a divide operation. This instruction eliminates approximately 20 percent of the cycles needed to compute a division. The division algorithm that incorporates this instruction provides a result that is precise to 22 bits in 1.5 microseconds. CAU control instructions. The c A U executes control instructions that alter the program flow. These control instructions include branches, conditional branches, subroutine calls and returns, branches on counter value instructions, and low-overhead loopings. Conditional branching can be accomplished under DAU, CAU, or I/O conditions related to the status of serial and parallel 1 / 0 buffers. Whether the branch is taken or not, the instruction following the branch instruction will always be executed, since it will already have been fetched. The low-overhead looping construct do M {K,rH} can implement a loop with a specified length without incurring the overhead of a test-and-branch step. The one-instruction-cycle overhead is required to execute the Do instruction. This loop instruction executes the next M + 1 instructions K(or rH) + 1 times, where rH is the contents of a CAU register. By permitting the loop counter operand to be placed into a register, the DSP32C Do-loop construct allows the application program flexibility in computing the number of times a loop is to be executed. The DSP32C does not restrict the type of instructions that may be inserted into the loop-except that the Do instruction cannot be nested. Threesperand CAU instructions. The CAU has a RISC-like (reduced instruction-set computing) instruction set. In fact the CAU has a load/store architecture. Having loaded operands into its general-purpose registers, the CAU can perform arithmetic and logic operations common to most microprocessors, at an execution rate of 12.5 MIPS. The CAU supports threeoperand (triadic) instructions for arithmetic and logic. An example is: rD = rS1 - rS2. zyxwvut zyxwvutsrqp zyxwvutsrqpon In this instruction, the contents of two registers are subtracted, and the result is stored in a third (destination) register. ArithmeticAogic instructions that do not have an immediate operand can be executed conditionally. An example is: if (eq) r3 = r l + r2 Here, the contents of registers R1 and R2 are added together, and the result is stored in R3 only if the result of the previous CAU instruction is zero (eq). These conditional instructions save the overhead of a testand-branch sequence common in microprocessors. CAU move instructions. To load and store the CAU general-purpose registers, the CAU uses a set of datamove instructions (see Figure 3). For example, the instruction r3 = *r14+ + indicates that the CAU register R3 is loaded with the contents of the memory location pointed to by the CAU register R14, and the contents of the latter are then incremented. These data-move instructions also handle the registers in the parallel and serial I/O units, so that direct transfers to and from the I/O units and memory can be performed. The following example shows a store of the contents of the serial input buffer (Ibuf) into a memory location pointed to by CAU register R9, with postincrementation of the contents of the CAU register: * r 9 + + = ibuf Internal architecture We now describe the DSP32C architecture a little more in depth. Processor cycle. The DSP32C has a processor cycle time of 80 nanoseconds, at a clock frequency of 50 MHz. Each processor cycle is divided into four states, numbered 0 through 3; each state equals one period of the clock input, or 20 ns (50-MHz clock frequency). Each DAU multiply/accumulate instruction requires up to four memory accesses: memory read (I instruction), memory read (X operand), memory read (Y operand), and memory write (Z operand). In one cycle the DSP32C can perform an instruction fetch, two operand fetches, and a write to memory or the I/O. For a given instruction, since the DAU is pipelined, these four accesses do not occur in the same processor cycle. (See Figure 4.) Figure 5 illustrates a “full pipe” of multiple, sequential DAU multiply/accumulate instructions. The subscripts refer to instruction numbers; for example, Xo is the X operand for instruction number 0. The DAU fetches and executes instructions in ascending order. zyxw Data arithmetic unit. As stated earlier, the DAU is the primary execution unit for signal processing algorithms. This unit contains a 32-bit floating-point multiplier, a 40-bit floating-point adder, four 40-bit accumulators, and a DAU control register, the DAUC. The multiplier and adder work simultaneously to process computations at the rate of 12.5 MIPS, each of the form (a = b + c * d). Figure 6 contains a block diagram of the DAU. The DAU transmits data to and receives data from other sections of the chip via the internal data bus. The DAU is divided into a data path and a control unit. The data path consists of a floating-point multiplier, floating-point adder, registers, buses, and bus connectors. The DAU multiplier and adder operate in parallel, each requiring one processor cycle for execution. The four accumulator registers may be read or written via program control. The DAU control unit decodes an instruction into signals that control the data path section. The DAU data path and control unit each contain zyxwvu zyxwvutsrqp instruction cycle Figure 4. Internal data bus (pipeline) for a single DAU instruction. Figure 5. Internal data bus (full pipeline) for multiple DAU instructions. December 1988 35 z DSP32C (40) (40) adder I DAU data path AO-A3 DAUC DU1 .DU2 IR IR1-IR4 Figure 6. Block diagram of the 139 zyxwvutsrqp Accumulators 0-3 DAU control register Adder input operand delay registers Instruction register DAU instruction register (processor cycles 1-4) P S X Y Product register Special-function register Multiplier input registers zyxw DAU. a four-stage (fetch-multiply-accumulate-store) pipeline. Thus, in one processor cycle, the DAU may be processing four different instructions, each in a different stage of execution. The DAU contains both 32-bit and 40-bit registers and buses to support two floating-point formats. In addition to the standard number of bits in the mantissa, the DAU maintains 8 mantissa guard bits for accumulate operations. Figure 7 shows the two DAU floating-point formats. When a 40-bit bus connects to a 32-bit register or multiplexer, the guard bits are excluded (truncated). For example, when writing a 40-bit accumulated result zyxwvutsrqp 16115 Mantissa Guard SI7 01 Exponent zyxwvutsrqponm Figure 7. 36 DAU control DAU floating-point formats 32 bits (a) and 40 bits (b). IEEEMICRO to memory, the result is first truncated to 32 bits. The 40-bit accumulated result may also be rounded to 32 bits using a DAU instruction. When a 32-bit register drives a 40-bit bus, the guard bits (bits 15 through 8) are set to zero on the 40-bit bus. Each DAU multiply/accumulate instruction involves three floating-point operands. Two of these operands are multiplied together, and the result is added to the third. One of the three operands comes from an accumulator. The result of the addition is stored in an accumulator and optionally written to memory or an I/O port. The value in this accumulator may be an intermediate result used in later multiply/accumulate operations, or it may be a final result to be stored in memory or sent to the I/O units. The DAU multiplier inputs can originate from memory, the S I 0 port, or an accumulator. Multiplier inputs are 32-bit floating-point numbers with a 24-bit mantissa and an 8-bit exponent. If the contents of an accumulator are input to the multiplier, the guard bits are first truncated. The multiplier always provides one input to the adder. The other adder input can originate from memory, the S I 0 port, or an accumulator. This input is either a 32- or 40-bit operand, depending on its source. The four accumulators, A0 through A3, provide the 40-bit operands. Although the multiply/accumulate structure is rigid, the DSP32C has flexibility in the way operands are loaded into the three inputs. This flexibility ensures that the DAU will be suitable for a wide variety of digital signal processing algorithms. The A0 through A3 accumulators eliminate roundingoff problems to ensure 24-bit precision. Postnormalization logic transparently shifts binary points and adjusts exponents to prevent inaccurate rounding of bits when the floating-point numbers are added or multiplied, thus eliminating concerns like scaling and quantization error. Each adder result, which is then stored in one of the accumulators, is fully normalized. All normalization occurs automatically. Single-instruction data conversions take place in the DAU to free application programs from the overhead required for these conversions. The DAU converts data between the DSP32C internal floating-point format and IEEE Std. 754,32-bit floating-point; 16-bitinteger; 24-bit integer; 8-bit p-law and A-law,6 and 8-bit linear formats. The DAU also provides an instruction zyxwvu to convert a 32-bit floating-point operand to a 32-bit seed value used for reciprocal approximation in division operations. Data instructionflow.The DAU employs a straightforward fetch-multiply-accumulate-store pipeline. Because all floating-point postnormalization is automatic, it does not require additional pipeline stages. The DSP32C takes advantage of this pipeline when multiply/accumulate instructions execute one after the other. Before considering consecutive multiply/ accumulate operations, let’s consider the flow of one instruction as it passes through the DAU pipeline. The DAU supports four multiply/accumulate instruction formats, thus providing flexibility in choosing operands and storing results. To simplify this discussion, we describe just one format. The DSP32C executes the multiply/accumulate instruction [ Z = ] a N = aM + Y * X, in four stages as follows: X and Y fetch, multiply X * Y, accumulate the product with accumulator AM and store the result in accumulator AN, and optionally store the result in the location specified by Z. If several multiply/accumulate instructions execute one after the other, the DSP32C automatically pipelines the instructions so that one instruction completes in every processor cycle. We show this process in the following block of DSP32 instructions: 1) [Z, = ] aN = aM + Y1 * X I , 2) [Z2 = ] aN = aM + Y2 X2, 3)[Z3 = ] a N = aM + Y 3 * X3, 4)[Z4 = ] a N = aM + Y4 * X 4 , a n d 5)[Z5 = ] a N = aM + Y5 X5. Yl-5, and Z1-5 to make it Again, we subscript easier to follow the data flow in the DAU. Figure 8 displays the instruction execution. The programmer must only be aware of 1) when an accumulate operation is finished, 2) whether the result accumulated in AN is intended to be used as an input to the multiplier, or 3) whether the contents of the location specified by Z are intended to be used as an input to the multiplier or the adder. For example, the con- zyxwvutsrqp zyxwvu * * zyxwvut 1) 2) 3) 4) 5) 6) 7) 8) writel write2 write3 write4 write5 accumulate, accumulate2 accumulate3 accumulate4 accumulate5 zyxwvuts zyxwvutsr multiplyl multiply2 multiply3 multiply4 multiply5 XY fetchl XY fetch2 XY fetch3 XY fetch4 XY fetch5 Figure 8. Execution of DAU instructions. December 1988 37 z zyxwv DSP32C zyxwvuts zyxwvuts zyxwvutsrq tents of AN accumulated in instruction 1 will be available as a multiplier input in instruction 4, but can be used as an adder input in instruction 2. The contents of the memory location or I/O port specified by Z in instruction 1 will be available as an input to the multiplier and/or the adder in instruction 5 . The fact that the contents of an accumulator, multiplier input, or adder input can be written to memory or to an I/O port without requiring an additional instruction is a key feature when performing such tasks as windowing, adaptive filtering, and matrix operations. While the four-stage pipeline in the DAU contributes to the high throughput of the DSP32C, it makes logic and control arithmetic problematic in the DAU. The CAU performs these functions. CAU instructions operate on 16- and 24-bit integers and resemble common RISC-style microprocessor instructions. The CAU is less pipelined than the DAU, and its integer arithmetic requires one instruction cycle. The CAU can perform its own operations while the DAU is in various stages of its pipeline. This arrangement allows the DAU and CAU instructions to work together. Internal timing. The internal pipeline timing of the DAU for a single instruction of the form Z = aN = aM + Y * X is shown in Figure 9. The instruction requires that five processor cycles (I, through I,) be decoded and executed in the DAU. Instruction I appears on the data bus in state 3 preceding I o . The X and Y registers are loaded in states 1 and 2 of I I , respectively. The contents of these registers are multiplied during 1 2 , and the product is loaded into the product register P in state 3 of I,. The adder input register S is also loaded in state 3 of 1 2 . The contents of the P and S registers are added during I , , and the result is loaded into an accumulator in state 3 of I , . This same result is then placed on the data bus (which is routed to a destination in memory, the P I 0 port, or the S I 0 port) in state 0 of I, if a Z-field is specified in the instruction. The Z-field is optional. Control arithmetic unit. The CAU seen in Figure 10 performs address calculations, branching control, and 16- or 24-bit integer arithmetic and logic operations. It consists of a 24-bit ALU that performs the integer arithmetic and logical operations, a 24-bit program counter register, and twenty-two 24-bit generalpurpose registers. The CAU has two modes of operation: one executes CAU instructions, and the other generates addresses for the operands of DAU instructions. CAU instructions perform data movement, branching control, and 16- and 24-bit integer or logical operations. Since DAU instructions can have up to four memory accesses per instruction, the CAU generates addresses for these locations. It uses the postmodified, register indirect addressing mode, and it generates one address in each of the four states of an instruction cycle. For instructions in the form of Z where X. Y = aM= aN= Z= Figure 9. DAU internal pipeline timing for a single instruction. 38 IEEEMICRO = zyxwvuts aN = aM + Y + Memory 110 registers a0 - a3 aO-a3 aO-a3 Memory I/O registers X Pointer bus zyxwvutsr + zyxwvutsrq Byte select A Increment bus t M A Address bus (22) (24) (24) zyxwvutsrqpo I CAU data path Update bus I IR PC R1 -RI 4 R15-Rl9 Pin (1320) Instruction register Program counter CAU general-purpose registers. DAU pointer registers CAU general-purpose registers. DAU increment registers CAU general-purpose register, serial DMA input register Pout(R21) IVTP(R22) M N 4 CAU general-purpose register, serial DMA output register CAU general-purpose register. interrupt vector table pointer Do-loop counter register (number of instructions in loop) Do-loop counter register, immediate value (number of iterations of loop) zyxwvutsrqp Figure 10. Block diagram of the MU. Registers R1 through R14 function as generalpurpose registers for CAU instructions and as memory pointers (RP) for DAU instructions. When used for memory pointers, registers R1 through R14 hold 24-bit addresses. Registers R15 through R19 act as generalpurpose registers for CAU instructions and as increment registers (RI) for DAU instructions. When used as increment registers, registers R15 through R19 hold 24-bit values that can postmodify addresses in the memory pointers. Register R20, called Pin (pointer in), acts as the S I 0 DMA input pointer, and R21, called Pout (pointer out), as the S I 0 DMA output pointer. Register R22, also called IVTP (interrupt vector table pointer), holds the base address of the interrupt vector table. The CAU data-move instructions specify data transfers between CAU registers and memory, CAU registers and I/O registers, and I/O registers and memory, based on 8-, 16-, 24-, and 32-bit operands. All CAU arithmetic operations are performed on 24-bit, twoscomplement, integer operands. If an instruction affects flags, the flags are calculated based on flag rules December 1988 39 DSP32C Data bus io IR r a CAU instruction completes before the next instruction begins execution. This process simplifies use of the CAU for logic and control operations. While DAU instructions execute, the CAU generates up to four addresses, one address in each of the four states of an instruction cycle. In each state, the CAU can add the contents of two registers: a pointer selected from registers R1 through R14 (placed on the pointer bus), and an increment selected from R15 through R19 (placed on the increment bus). The CAU employs a three-stage pipeline to 1) fetch from a register@), 2) operate on the fetched operand@), and 3) store the result in a register. Because of this pipelining, the preceding pointer is updated (result placed on the update bus), while the next pointer and increment are being accessed. Figure 12 shows the CAU operations for executing a DAU instruction of the form: Z = aN = aM + Y * X Note that the result of the multiply/accumulate operation is not available on the data bus until state 0 of instruction cycle Is (see discussion of DAU pipeline) and the destination address is calculated during state 1 of instruction cycle 12.Thus, the address is latched and held for three instruction cycles before being placed on the address bus. The Z-address delay block shown in Figure 10 performs this three-instruction cycle delay. zyxwvutsrqpo zyxwvutsrqponm zyxwvutsrqpon zyxwvutsrqponm 12 11 10 12 11 Z Flags zyxwv ? For instructions in the form of lo .rD-rS I I . if(eq)rD = rS1 + rS2 where rD = Destination register rS rS , rS2 = Source register eq = CAU condition equal to zero, basedon the CAU z (zero)flag Modification of registers and flags IS conditional PC = Program counter + = Addition operation - = Subtraction operation 7 = Figure 11. CAU internal pipeline timing for a CAU instruction. for 16-bit or 24-bit operations, depending on the size of the operation as specified in the instruction. For example, when performing operations using 16bit integer arithmetic, 1) 16-bit data is loaded into the lower 16 bits of the 24-bit CAU register(s), with the most significant bit of the integer extended into the upper 8 bits; 2) 24-bit arithmetic is performed, with flags computed according to 16-bit operation flag rules; and 3) the results are stored in memory by writing the lower 16 bits of the register onto a 16-bit memory location. Many of the CAU arithmetic instructions can be performed using three different operands: two source registers (RSI, RS2) and one destination register (RD). CAU instructions can also be executed conditionally, based on flags generated in the CAU. Figure 11 illustrates the CAU operations for executing two instructions of the form: zyx Memory. The DSP32C provides on-chip memory (RAM only or a RAM/ROM combination) and an external memory interface for off-chip memory expansion. Instructions and data can arbitrarily reside in onchip RAM, on-chip ROM, or external memory. The addresses of the various blocks of memory can be configured in eight different memory modes. Internally, the DSP32C contains 4,096 bytes of RAM that are available in the two memory configurations offered. Also, the chip provides either 8,192 bytes of mask-programmable ROM or an additional 2,048 bytes of RAM. Thus, two on-chip memory configurations are possible: 4,096 bytes of RAM and 8,192 bytes of ROM, or 6,144 bytes of RAM.The on-chip RAM is static and does not need to be refreshed. Users can mask-program the ROM with an application program(s) and/or fixed data. In addition, they can secure the contents of ROM, to protect proprietary code from examination, by wire-bonding the corresponding pad inside the package. The external memory interface directly addresses up to 16 Mbytes of memory, with no loss of execution speed. This interface also supports wait states to accommodate access to slower speed memory and 1 / 0 peripherals, and bus arbitration to accommodate the need for sharing the external memory interface signals with other devices. The two-section external memory is divided into a low partition A and a high partition B. Users can independently configure the number of wait states for each partition. Therefore, a mix of slow and fast memory can be used to provide the necessary throughput at a reasonable cost. zyxwvutsrqp zyxw rD - rS if (eq) rD = rS1 + rS2 The first instruction subtracts the contents of the RS register from the contents of the RD register to set or clear the CAU z (zero) flag. The second instruction checks the z flag, and if it is set (that is, the result of the previous instruction was zero), adds the contents of RS1 and RS2 and stores the result in RD. If the flag is not set, the contents of RD are unaffected by this instruction. Except for the register load from memory that has a latency of one instruction cycle, execution of 40 IEEEMICRO zyxwvutsrqponm zyxwvutsrqponmlkjihgfedcbaZ zyxwvutsrqp zyxwvutsr For instructionsin the form of I, * r Pz + + r Iz =aN =aM + * r P e + r I y + ' r Px + + r Ix where r P z , r Py , r Px = Pointer register for DAU, 2 , Y, and X operands r Iz , r I y,r I = Increment register for DAU, Z. Y, and X operands Figure 12. CAU internal pipeline timing for a DAU instruction. PC = Program counter + = Addition operand A = Address zyxwvutsrqpo Each wait state takes one quarter of an instruction cycle (20 ns). The onchip address, data latches, and memory control signals allow a zero-chip interface to a standard byte-wide memory chip. All instructions are 32 bits wide, fetched in one memory access. Four data types (8-, 16-, 24- and 32-bit) exist, and the memory is uniformly byte addressable, with 32-bit data accessed at the same speed as 8-bit data. Data bus. The DSP32C employs a von Neumann architecture with one address bus and one data bus. Data travels throughout the DSP32C via its 32-bit internal data bus. This data bus supports four data transfers during each processor cycle: the instruction fetch, two memory operand reads, and a memory write. The internal data bus accesses all sections of the chip as well as the external memory interface. The latter interfaces to a 32-bit external data bus. Parallel I/O unit. The P I 0 unit consists of a register file, control unit, and a bidirectional data bus. It allows the DSP32C to communicate with external devices. An external microprocessor easily controls the DSP32C via the parallel port. For example, the P I 0 unit can initiate DMA to or from the DSP32C, reset the device, halt it, or monitor various conditions internal to it. The external parallel I/O data bus PDB can be either 8 or 16 bits wide. The PI0 contains three 16-bit data registers (PDR, PDR2, and PIR), a 24-bit address register (PAR/ PARE), a 16-bit processor control word (PCW), an 8-bit 110 port register (PIOP), a 10-bit control register (PCR), a 16-bit error mask register (EMR), and an 8-bit error source register (ESR). These registers control P I 0 transfers and configure error control and interrupt features. Users can configure the PI0 unit as an 8- or 16-bit parallel data bus interface. When configured as an 8-bit interface, the P I 0 supports two additional 4-bit P I 0 ports, which can be individually configured as inputs or outputs and read or written by the DSP32C. When a port is configured as an input, the DSP32C can read the 4 bits of data present at the pins and those bits can be used for data or control. The P I 0 monitors six maskable, internal error conditions. It can be configured to interrupt an external device and, if desired, halt the DSP32C when an unmasked error condition occurs. P I 0 data transfers can be made under program, interrupt, or DMA control. Using P I 0 DMA, an external device can download a program and upload the results of an operation without interrupting the DSP32C program in progress. P I 0 bufferempty and P I 0 buffer-full conditions allow the DSP32C program or interruptdriven I/O to read and/or write to the P I 0 port. December 1988 41 zyxwv zyxwvutsrqpo DSP32C Serial I/O unit. The S I 0 unit provides serial communication and synchronization with external devices. External control pins on the DSP32C allow a direct interface to a time/division/multiplexed (TDM) line, a zero-chip interface to many codecs (coder/decoders) and direct DSP32C-to-DSP32C transfers for multiprocessor applications. The S I 0 performs serial-to-paralle1 conversion of input data and parallel-to-serial conversion of output data, at a maximum rate of 16 Mbits/s. This unit is composed of a serial input port, a serial output port, and an on-chip clock generator. A serial input/output control IOC register controls the S I 0 input and output formats. The serial input and serial output ports are doubly buffered, making back-to-back transfers possible by allowing a second transfer to begin before the first has been completed. The transfer lengths of the serial input and output words can be configured as 8, 16,24, or 32 bits, and are selected independently of each other. The input and output data can also be independently selected to process either the most significant bit or the least significant bit first. S I 0 transfers can be made under program, DMA, or interrupt-driven control. In DMA mode, transfers occur between Ibuf and memory or between memory and Obuf without program intervention. The serial input buffer-full and serial output bufferempty flags allow the DSP32C program or interrupt-driven I/O to read and/or write to the S I 0 port. Interrupts The DSP32C provides a single-level interrupt facility and responds to six interrupt sources, four internal and two external. The interrupts are prioritized and are individually maskable. A relocatable vector table con- bits -4 Interrupt source External interrupt 1 PI0 buffer full +24 1 - 1 +32 +40 +48 +56 Figure 13. Interrupt vector table. 42 zyxw zyxwvutsrqp zyxwvutsrqponm zyxwvutsrq Address k-32 +16 trols program flow based on the source of the interrupt. The following list presents the interrupt sources in order of descending priority: 1) External interrupt one (INTREQl); 2) Parallel I/O buffer full (PDF) generated when the PDR register is loaded; 3) Parallel I/O buffer empty (PDE) generated when the PDR register is read; 4) Serial input buffer full (IBF) generated when the serial input buffer is loaded; 5 ) Serial output buffer empty (OBE) generated when the serial output buffer is emptied; and 6) External interrupt two (INTREQ2). In response to a given interrupt, the DSP32C branches to the corresponding address in the interrupt vector table (Figure 13). This table contains eight pairs of 32-bit words starting at the location specified in the IVTP register R22. Before interrupts are enabled, the IVTP register should contain the base address of the interrupt vector table. Before servicing an interrupt, the DSP32C will automatically save the state of the machine that is invisible to the programmer, the DAU accumulators A0 through A3 (including guard bits), and the DAUC register. The internal state that is visible to the programmer must be saved and restored by the interrupt service routine. To return to the interrupted program, the interrupt service routine must restore the user-visible state of the DSP32C (that was saved) and then execute the Ireturn instruction. This step restores the state of the machine that is not visible to the user. The low interrupt overhead results from the use of only two instruction cycles to enter an interrupt service routine and one instruction cycle to return to the interrupted program. IEEEMICRO P I 0 buffer empty S I 0 input buffer full S I 0 output buffer empty External interrupt 2 Reserved Reserved Development tools Users can access both software and hardware tools to aid in the development of application programs for the DSP32C. Software. Software tools used to create, test, and debug DSP32C application programs at the assemblylanguage level are packaged in the WEDSP32C-SL Support Software Library. The library includes an assembler, a link editor, a simulator, and other utilities, all of which run under the Unix or MS-DOS operating systems. (Work is in progress to port these tools to the Macintosh and VMS operating systems.) The assembler translates the user’s assembly-language program into the binary code used by the DSP32C. A notable feature of the DSP32C assembly language is its use of a high-level, C-like syntax. The accompanying box displays two programming examples. The assembler generates relocatable code that the link editor can easily alter. The relocatable code can reside anywhere in the addressable space and can be combined with code that was assembled separately. The simulator program emulates the operations of the DSP32C program in a nonreal-time environment. For full program debugging, the simulator allows access to all registers and memories. It also provides an interface to the DSP32C hardware development system described later. The simulator’s capabilities include single-stepping,breakpointing, and execution profiling. Simulator users can freeze processing on many conditions, such as the execution of a specified instruction, access of a specified register or memory location, or occurrence of a specified number of 110 events. A file can supply input data, while another file can capture output data. Users can refer to memory locations by their symbolic names rather than their absolute addresses. In addition, users can define complex command sequences and display formats, and invoke them with one command. Another very powerful software tool is the optimizing C-language compiler. The DSP32C compiler allows users to write application programs in the general, high-level C language. Thus, in applications in which users develop preliminary programs in C, the source code can be moved to the DSP32C with a minimal amount of time and effort. The optimizing portion of the C compiler performs generic optimizations aimed at taking advantage of the DSP32C instruction set and resources. In addition, the optimizer can analyze program data flow to satisfy data dependencies introduced by some pipeline latencies in the DSP32C. Pipeline optimization reorders instructions where possible or uses NOP (no operation) instructions to flush the pipeline. Some data dependencies cannot be resolved from a C program. Extra user control of the optimization process is provided at the C source level for such cases. Hartung et al. offers further information about the C compiler. Application library. Part of the DSP32C product line includes a set of common routines written in the DSP32C assembly language. These routines compose the WEDSP32C-AL Applications Library. This library contains many commonly used signal processing functions such as finite impulse response (FIR) filters, infinite impulse response (IIR) filters, adaptive filters (LMS algorithm), and fast Fourier transforms (FFTs). In addition, the library provides commonly used functions such as sin(x), cos(x), ln(x) and log(x). As of this writing, the library contains over 60 subroutines-which also execute on the DSP32.’p8 The DSP32C C compiler can call these Application Library subroutines. zyxwvuts zyxwvutsrqp zyxwvuts zyxwvuts zyx C compiler. The DSP32C C compiler is a complete implementation of the C language. It supports all integer and single-precision floating-point data types. The compiler is an implementation of the Unix portable C compiler, which guarantees portability of C programs from Unix System V machines to the DSP32C. In addition to handling generic operations, like + , -, *, /, %, &, I, and so on, the DSP32C C compiler takes full advantage of the DSP32C multiply/accumulate instructions. For example, complex operations, often found in signal processing, such as a = b + (c d) are compiled into one instruction. * Hardware. The WEDSP32C-DS development system supports application system hardware development and real-time evaluation of DSP32C programs. The DSP32C-DS development system, a PC-based family of five boards, can be used to 1) develop DSP32C programs and test them in real-time and 2) perform incircuit emulation of user-target hardware. The modular design of the development system allows multiple configurations to achieve various development environments. The DSP32C simulator, running on an AT&T PC6300 (or compatible), provides a user inter- zyxw December 1988 43 z DSP32C zyxwvutsrqp zyxwvutsrq / * 3x3 x 3x1 matrix m u l t i p l y f o r DSP32C #define #define #define #define matA r l matB r 2 matC r 3 dec r15 mat3xl: a0 a0 *matC++ = a0 end: nop zyxwvutsrq zyxwvutsrqp matA = A matB = B matC = C dec = -8 do 2 . 2 */ / * a d d r e s s of i* A[O,O] a d d r e s s of BCO.01 / * a d d r e s s of C [ O , O ] /* */ */ */ dec i s used t o r e - i n i t matB p t r = *matB++ * *matA++ = a0 + *matB++ * *matA++ = a0 + *matB++dec * *matA++ zyxwvutsrqp zyxwvutsr / * Data f o r matrix A . */ . r s e c t ‘.rami’ / * Assembler d i r e c t i v e . */ B: float 1.0,2.0,3.0 /* C: 3 * f l o a t 0.0 / * S t o r e s r e s u l t of C h e r e * / A: */ float 1.0.2.0.3.0 float 5.0,6.0,7.0 f l o a t Q.O.10.0.11.0 T e s t d a t a f o r matrix B . * / Figure A. A simple, codeefficient DSP32C programming exampk (3 x 3) x (3 x 1) matrix multiply. The first four #define statements are simply preprocessor commands that instruct the preprocessor to replace a specific variable with a DSP32C register each time the variable is encountered in the program. These optional statements help make the resulting code more readable, since we chose the variable names to be meaningful (matA is the variable that points to elements in matrix A, and so on). Each line is a single-word instruction. The four instructions beginning at the label mat3 x 1 initialize the pointer variables (registers) used in the matrix multiply. A value of -8 decrements the pointer to matrix B by two 32-bit locations (recall that the address space is byte addressable, so 8 bytes equal two 32-bit locations). Once the pointers are initialized, only four instructions compute the matrix multiplication. The first instruction (do 2,2) is a low-overhead looping instruction. Its syntax is “DO the next N + 1 instructions M + 1 times,” which, in this case, translates to execute the following block of three instructions three times. Branching to re-execute the block of three instructions incurs no overhead. Each of the three instructions in the loop come from the DAU multiply/accumulate group. Borrowing from the C programming language, we use an asterisk before a variable or register to indicate 44 IEEEMICRO the memory location pointed to by that variable or register. The asterisk operator also indicates multiplication. If + follows a variable or register, it indicates postincrementation by one memory location (a 32-bit location for multiply/accumulate instructions). In the third multiply/accumulate instruction, the variable matB (in register R2) is followed by +dec, which means postincrementation by the contents of the variable dec (previously defined to be in register R15). In this case, dec has been initialized to -8, which re-initializes the variable matB to point to the first element of matrix B. Each pass through the loop multiplies a row of the square matrix A by the column matrix B and writes the summed result as a new entry to the column matrix C. Note that the write to matrix C is performed in the last multiply/accumulate instruction and avoids the necessity of a separate data move instruction. + + Example 2 This second example shows the assembly code for an in-place, 4,0%-point, complex, radix-2, decimation-in-time FFT routine. zyxwvutsrq zyxwvutsrqpon zyxwvutsrqpon call -fft (r14), nop /* COMPLETE DSP32C assembly code listing for FFT */ 4096, 12, Real, Imag /* Arguments for 4096 p t complex FFT */ ur a2 ui a3 /* Generalized FFT subroutine for in-place complex radix 2 DIT FFT */ Jft rlOe==*rl4++ /* Get arguments - N */ r7e-*r14++ /* l W N ) */ r2e-*r14++ /* Pointer to Real */ r17e-*r14-/* Pointer to Imag */ r9e=2 r7e-r7-2 bitrev r 17e-r 17-r2 /* IN-PLACE BIT REVERSAL */ r19e---r17 r 16e-r10*2 r18e-rlG3 rle-r2 rle-r2 A i f (ge) pcgoto B, r5e-rl *rl++r17 a0 = *r2++rl7 *rl++r19 a0 *r2++r19 *r2++r17 a0 *r5++r17 *r2++r19 a0 *r5++r19 B rle-r1+4 if ( r l g - >- 0) pcgoto A, r2e-r2#r16 r8e-1 /* FFT CALCULATION */ fftc r4e-W /* Initialization */ rl3e-one r6e-two r10e-r10/2 C rge-r9*2 r 15ePr9*2 r15e-rl5-rl7 /* DFT Stage initialization */ r2e-514 ur-*rl3++ ui-*rl%r16e-r&2 r8e-r8*2 rle-r2 D rlle-rl r3e-rl+r9 /* Twiddle calculation initialization */ r12e-r3 rl8e-rl0-1 butfly do 5, r18 /* BUTTERFLY (6 instruction cycles) */ a0 *rl++r17 + ur *r3++r17 /: T R(i)+Ur*R(k) */ *rll++r17-a0--a0*r3++r19 *ui / R(i)-T- T-UiII(k) */ a1 *rl++rl9 + ui**r3++r17 /* T I(i)+Ui R(k) */ *r12++r17-a0--a0+*rl++rl7* *r6 /* R(k)- -T+2*R(i) */ *r 11+ +r 15=ai-ai+ *r3++r 15*ur /* I i)==T- T+Ur*I(k) */ *rl2++rl5-al--al+*rl++r15**r6 /* -T+2*1(i) */ twid a0 ur *r4++ /* Compute next twiddle */ ur a0 ui * *r4 /* u = u * w *I ur * *r4ui a0 + ui * *r4 i f r16- >- 0) pcgoto D, r2e=r2+4 i f 117- >= 0) pcgoto C, r4e=r4+8 E goto r14+4, nop two float 2 0 /* Constant, Twiddle and Data tables * / one float 1 0, 0 0 w float -1 0000000,OOOooOOO,O0000000,-1 0000000,O 7071068,-0 7071068 float 0 9238795,-0 3826834,O 9807853,-0 1950903.0 9951847,-0 0980171 float 0 9987955,-4 9067674e-2,O 9996988,-2 4541228e-2,O 9999247,-1 2271538e-2 float 0 9999812,-6 1358846e-3,O 9999953,-0 0030676,O 9999988,-0 00153398 /* Data for f(n)-sin(nr/2)+sin(nr/4+r/4) * / =ox004000 512*float 0 707,2 ,O 707,-1 ,-0 707,O ,-0707,-1 /* Real data in true order */ Real Imag 4096'float 0 0 /* imaginary data */ main int24 #define #define -- -- -- - --- - zyxwvuts zyxwvutsrqp $k)- - 1 zyx December 1988 45 DSP32C zyxwv zyxwvutsrq zyxwvu zyxw face. The simulator allows the user to download programs to the device on the development system and to execute them there. The user can set breakpoints on instruction fetches, allowing the memory, registers, and accumulators to be examined and modified. The PC communicates with the DSP32C device directly through its parallel 110 port. The DSP32C Development Card lets the user develop software algorithms in a PC environment. This AT&T PC6300 (or compatible) plug-in card contains a DSP32C device and a 16K x 32-bit, high-speed static RAM that can be upgraded to a 64K x 32-bit static RAM. The DSP32C serial 1 / 0 port interfaces to an onboard codec that is socketed to accept either an AT&T T7520A high-precision codec with filters or a pincompatible device. Two mini-coaxial connectors on the development card allow users to supply analog 1 / 0 to the codec. Users can bypass the on-board codec and interface directly to the DSP32C S I 0 port through a 34-pin connector on the card. The upper byte of the DSP32C P I 0 port is also available to users as I/O bits via a 16-bit connector on the card. The DSP32C Development Card can be used independently or with a half-height daughter board, which increases the available memory by 1M x 32-bit data. This DRAM Extended Memory Card interfaces the DSP32C on the development card to four banks of 256K x 32-bit dynamic RAM. When both cards are used together, they occupy one and a half PC slots. The DSP32C In-Circuit Emulator Card emulates target system hardware in real time. This card contains a DSP32C device; a 133-pin, PGA-adapter socket, which plugs into the target hardware; and buffers to allow the user to access the DSP32C P I 0 port during breakpoint operation. A 44-pin ribbon cable connects the card to the PC via a half-height PC plug-in card called the DSP32C PC Bus Card. The PC Bus Card provides the parallel communication between the PC and the DSP32C on the in-circuit emulator. Used in conjunction with the PC Bus Card, the emulator card allows the user to run breakpointed code entirely from the target system hardware. The PC Bus Card can also be used with the DSP32C Multi-InCircuit Emulator Card, which allows the user to multiplex up to four emulator cards to a single PC slot. The stand-alone DSP32C multiemulator interface also interfaces to the PC over a 44-pin ribbon cable connected to the PC Bus Card. When using this interface, the user can select any single emulator card for communication with the host PC. The software operates in a polled mode to control this operation. Performance benchmarks, applications Table 2 lists a suite of signal processing benchmarks set for the DSP32C. The FFT benchmark does not include bit-reversed ordering of the output. Note that an 46 IEEEMICRO Table 2. DSP32C signal processing benchmarks. Benchmarks Time FIR filter LMS adaptive filter 5 multiply second-order section (IIR) Lattice filter Divide (3 x 3)(3 x 1) matrix multiply 256-point window 1,024-point, complex FFT 80 ns/tap 160 ns/tap 400 ns 160 ns/section 1.5 ps 720 ns 20 ps 2.9 ms in-place, 1,024-point, complex FFT with ordering of outputs takes 3.2 ms. The availability of the C compiler means that highlevel programming language benchmarks are meaningful for the DSP32C. We have measured the singleprecision Whetstone benchmark and the Dhrystone benchmark and found that the DSP32C compares very favorably with state-of-the-art microprocessors. The following benchmark results were obtained from a C-code implementation of the benchmarks compiled with the DSP32C C compiler: 6.46 million Whetstones per second, and 14,336 Dhrystones per second. As can be seen from the features we've described, the DSP32C is suitable for use in many different application areas such as telecommunications, speech processing, image processing, graphics, array processors, robotics, studio electronics, instrumentation, and military applications. Table 3 lists some of the applications. he DSP32C upwardly compatible extension to the DSP32 architecture excels in many signal processing applications l0-'3 as well as in nontraditional DSP chip applications. The enhancements in the DSP32C widen the spectrum of potential applications for the DSP32C. We have described the architecture and instruction set of the DSP32C and its support products including the C compiler, which zilows an easy implementation of many applications. I T zyxwvu Acknowledgments We gratefully acknowledge the contributions of C.N. Tanga, N. Agrawal, and J.E. Beck for their work on the design, support, and verification of the Table 3. DSP32C applications. Application Example zyxwvutsrqpon zyxwvut Electronic data processing Mass memory Disk controllers, highprecision servo control Workstations Graphics, translations, rotations, shading, perspective scaling, inversion, multiplication, numeric accelerators, array processing Front-end Bit-manipulation, encryption processor Industrial Robotics Image processing Process control Real-time simulators Instrumentation High-precision servo control Restoration, pattern recognition, compression Minicomputer functions Graphics, servo control, system modeling Oscilloscopes, FFT, spectrum analysis, signal generators Telecommunications PBX Tone detection, tone generation, MF, DTMF Switches Tone detection, tone generation, line testing Modem Echo cancellation, filtering, error correction and detection Transmission Multipulse LPC, ADPCM, transmultiplexing, encryption Government/military Sonar ECM Airframe Radar tracking Speech Recognition Synthesis Coding Consumer Studio electronics Entertainment Educational Beam forming, FFT FFT, adaptive filtering Simulation Precision FFT, matrix inversions DSP32C; and our design partners in R.N. Kershaw’s group in the Digital IC Design Department for their work on the circuit design and layout of the device. We also thank M. Murphy and A.A. Pignone for their design of the development system; J. DeHart for developing several of the DSP32C software tools; and the DSP Systems Development Group under J. Hartung for developing the C compiler, the Applications Library, and several applications. J.R. Boddie and R.A. Pedersen managed the DSP32C project. zyxwvut zyx zyx zyxwvut References 1. J.R. Boddie et al., “A Floating Point DSP with Optimizing C Compiler,” Proc. IEEE Int’l Conf. Acoustics, Speech and Signal Processing, Vol. 88CH2561-9, Apr. 7-11, 1988, pp. 2009-2012. 2. W.P. Hayes et al., “A 32-Bit VLSI Digital Signal Processor,” IEEE J. Solid State Circuits, Vol. SC20, No. 5 , Oct. 1985, pp. 998-1004. 3. ANSI/IEEE Standard 754-198.5f o r Binary FloatingPoint Arithmetic, IEEE Comp. Soc., Los Alamitos, Calif., 1985. 4. P.M. Kogge, “The Microprogramming of Pipelined Processors,” Proc. Fourth Ann. Conf. Computer Architecture, Mar. 1977, pp. 63-69. 5. W.H. Press et al., Numerical Recipes-The Art of Scientific Computing, Cambridge University Press, 1986, p. 254. 6. A. Kundig, “Digital Filtering in PCM Telephone Systems, ” IEEE Trans. Audio and Electroacoustics, Vol. AU-18, No. 4, Dec. 1970, pp. 412-417. 7. J.R. Boddie et al., “The Architecture, Instruction Set and Development Support for the WEDSP32 Digital Signal Processor,” Proc. IEEE ICASSP, Vol. 86CH2243-4, Apr. 7-11, 1986, pp. 421-424. 8. WEDSP32 Support Software Library, AT&T, Allentown, Pa., July 1986. 9. J. Hartung et al., “A Practical C Language Compiler/ Optimizer for Real Time Implementations on a Family of Floating Point DSPs,” Proc. IEEE ICASSP, Vol. 88CH2561-9, Apr. 7-11, 1988 pp. 1674-1677. Feature extraction, spectrum analysis, pattern matching LPC, formant synthesis ADPCM, LPC, multipulse LPC, vector quantization 10. R.V. Cox et al., “Implementation and Application of an Embedded Subband Coder,” Proc. Int’l Conf. Communications, Vol. 88CH2538-7, Jun. 12-15, 1988, pp. 90-95. 11. J. Tow et al., “Implementation of DSP Applications Using the AT&T DSP32C C Compiler and Applications Library,” Proc. Int’l Syrnp. Circuits and Systems, Vol. 88CH2458-8, Jn. 7-9, 1988, pp. 1061-1065. Digital audio 12. R.V. Cox et al., “New Directions in Subband Coding,” High-end video (special effects) zyxw IEEE J. Selected Areas in Commun., Special Issue on Voice Coding for Commun., Vol. 6, No. 2, Feb. 1988, pp. 391-409. 13. J. Tow, “Implementation of Digital Filters with the WEDSP32 Digital Signal Processor,” Application Note, AT&T Microelectronics, Allentown, Pa., Apr. 1988. December 1988 47 zyxwv zyxwvutsrq DSP32C Michael L. Fuccio was lead engineer for the DSP32C project in the AT&T DSP Design Group. He is presently on temporary assignment at the City College of New York where he is teaching computer architecture courses. He has been involved in the architecture of 32-bit microprocessor chips and peripherals, and design and testing of the WEDSP16 DSP. His interests include the architecture of DSPs, adaptive signal processing, multiprocessing, and engineering education. Fuccio received the BEEE degree from the State University of New York at Stony Brook and the MSEE degree from Georgia Institute of Technology. He is presently pursuing further graduate study at Rutgers University in New Jersey. He is a member of the IEEE Computer Society. Benjamin Ng is a member of the technical staff in the DSP Design Group. He was a member of the architecture team for the WE32100 chip set and an architect for the WE32201 MMU/data cache. Then he worked on the architecture and design of a graphics processor. His areas of interest include computer architecture and computer graphics. Ng received his BSEE from Cornell University, New York, and his MSEE from Columbia University also in New York. He is a member of the IEEE Computer Society. zyxwvutsr Renato N. Gadenz is a distinguished member of the technical staff working in the DSP Design Group. He was a member of the team that developed the first Bell Labs programmable DSP and has been connected since then with the design, simulation, and/or testing of each member of the AT&T DSP family. Previously, he had designed active filters, developed software for testing microprocessors and microprocessor systems, and worked on mixed analog and digital ICs. Before joining Bell Labs, he was a professor and chair of the Departamento de Electricidad at the Universidad de Chile and a lecturer and associate research engineer at the University of California at Los Angeles. Gadenz received his Ingeniero Civil Electricista degree from the Universidad de Chile in Santiago, his MSEE degree from the University of Pittsburgh, and his PhD degree from the University of California, Los Angeles. He is a member of the IEEE. Craig J. Garen, as supervisor of the DSP Design Group at AT&T Bell Laboratories, most recently supervised the design of the DSP32C. He has contributed to the design of three general-purpose programmable DSPs, the DSP20, DSP16, and DSP32. Garen received his BSEE from Lehigh University in Pennsylvania and his MSEE from California’s Stanford University. He is a member of Tau Beta Pi and Eta Kappa Nu. Steven P. Pekarich currently works as a member of the technical staff in the DSP Design Group. He was responsible for the architecture and design of the CAU module in the DSP32C. Current interests include CPU and DSP architectures and the design of CAD tools for architectural studies. Pekarich received his BSEE from Monmouth College and his MS in computer science from Stevens Institute of Technology, both in New Jersey. He is a member of Eta Kappa Nu, Sigma Pi Sigma, and ACM, and an associate member of the IEEE Computer Society. Kreg D. Ulery is a member of the DSP Product Management Group in AT&T Microelectronics. He has worked in product engineering, application engineering, and technical support for the DSP product family. Prior to this assignment, he worked at Bell Laboratories where he participated in digital and analog design projects in the DSP Design Department. Ulery received a BSc degree from Pennsylvania’s Messiah College and an MSEE degree from Rutgers University in New Jersey. Questions concerning this article can be directed to Renato N. Gadenz, AT&T Bell Laboratories, Rm. 2D-235, Holmdel, NJ 07733. Joan M. Huser is a member of the technical staff working in the DSP Design Group. She has been involved in a variety of digital signal processing areas, including VLSI design, hardware development system design, applications programming, and technical support. Huser received BS degrees in electrical engineering and computer science from Washington University in St. Louis, Missouri, and an MS degree in electrical engineering from Stanford University. 48 IEEEMICRO Reader Interest Survey zyxw Indicate your interest in this article by circling the appropriate number on the Reader Interest Card. Low 156 Medium 157 High 158

RELATED PAPERS

RELATED TOPICS

Log In

The DSP32C: AT second generation floating point digital signal processor

The DSP32C: AT second generation floating point digital signal processor

Related Papers

RELATED PAPERS

RELATED TOPICS