DSP Hardware: EKT353 Lecture Notes by Professor Dr. Farid Ghani

1
DSP Hardware Introduction: Since their introduction in the early l980s, DSP processors have grown substantially in complexity and sophistication to enhance their capability and range of applicability. This has also led to a substantial increase in the number of DSP processors available. To reflect this, features of successive generations of fixed and floatingpoint DSP processors and factors that affect choice of DSP processors are considered in the following pages. For convenience, DSP processors can be divided into two broad categories: general purpose and special purpose. DSP processors include fixed-point devices such as Texas Instruments TMS320C54x, and Motorola DSP563x processors, and floatingpoint processors such as Texas Instruments TMS320C4x and Analog Devices ADSP21xxx SHARC processors. There are two types of special-purpose hardware, 1. Hardware designed for efficient execution of specific DSP algorithms such as digital filters, Fast Fourier Transform. This type of special- purpose hardware is sometimes called an algorithm-specific digital signal processor. 2. Hardware designed for specific applications: for example telecommunications, digital audio, or control applications. This type of hardware is sometimes called an applicationspecific digital signal processor. In most cases application-specific digital signal processors execute specific algorithms, such as PCM encoding/decoding, but are also required to perform other application-specific operations. Examples of special-purpose DSP processors are Cirruss
EKT353 Lecture Notes by Professor Dr. Farid Ghani
processor for digital audio sampling rate converters (CS8420), Intels multichannel telephony voice echo canceller (MT9300), FFT processor (PDSPI65I5A) and programmable FIR filter (VPDSP 16256). Both general-purpose and special-purpose processors can be designed with single chips or with individual blocks of multipliers, ALUs, memories, and so on. First, we will discuss the architectural features of digital signal processors that have made real-time DSP in many areas possible. Most general purpose processors available today are based on the Von Neumann concepts, where operations are performed sequentially. Figure 1 shows a simplified architecture for a standard Von Neumann processor. When an instruction is processed in such a processor, units of the processor not involved at each instruction phase wait idly until control is passed on to them.
Address Generator Address bus
ALU I/O Devices Accumulator Product Register Multiplier Program and data Memory
Data bus
Figure 1. A simplified architecture for standard microprocessor
Increase in processor speed is achieved by making the individual units operate faster, but there is a limit on how fast they can be made to operate. If it is to operate in real time, a DSP processor must have its architecture optimized for executing DSP functions.
Arithmatic Unit
ALU I/O Devices Shifter
Multiplier Accumulator
Memory Unit X data Memory Y data Memory

Program
Memory
X Data bus Y Data bus P Data bus
Figure 2. Basic generic hardware architecture for signal processing Figure 2 shows a generic hardware architecture suitable for real time DSP It is characterized by the following: Multiple bus structure with separate memory space for data and program instructions. Typically the data memories hold input data, intermediate data values and output samples, as well as fixed coefficients for, for example, digital filters or FFTs. The program instructions are stored in the program memory. The I/O port provides a means of passing data to and from external devices such as the ADC and DAC or for passing
digital data to other processors. Direct memory access (DMA), if available, allows for rapid transfer of blocks of data directly to or from data RAM, typically under external control. Arithmetic units for logical and arithmetic operations, which include an ALU, a hardware multiplier and shifters (or multiplier--accumulator) Why is such architecture necessary? Most DSP algorithms (such as filtering correlation and fast Fourier transform) involve repetitive arithmetic operations such as multiply, add, memory accesses, and heavy data flow through the CPU. The architecture of standard microprocessors is not suited to this type of activity. An important goal in DSP hardware design is to optimize both the hardware architecture and the instruction set for DSP operations. In digital signal processors, this is achieved by making extensive use of the concepts of parallelism. In particular, the following techniques are used: 1. Harvard architecture; 2. pipe-lining; 3. fast, dedicated hardware multiplier/accumulator; 4. special instructions dedicated to DSP; 5. replication; 6. on-chip memory/cache; 7. extended parallelism SIMD, VLIW and static superscalar processing. For successful DSP design, it is important to understand these key architectural features.
Harvard architecture: The principal feature of the Harvard architecture is that the program and data memories lie in two separate spaces, permitting a full overlap of instruction fetch and execution. Standard microprocessors, such as the Intel 6502, are characterized by a single bus structure for both data and instructions, as shown in Figure 1. Suppose that in a standard microprocessor we wish to read a value op I at address ADR 1 in memory into the accumulator and then store it at two other addresses, ADR2 and ADR3. The instructions could be LDA ADRI ADRI STA ADR2 STA ADR3 load the operand op1 into the accumulator from store op1 in address ADR2 store op1 in address ADR3
Typically, each of these instructions would involve three distinct steps: instruction fetch; instruction decode; instruction execute. In our case, the instruction fetch involves fetching the next instruction from memory, and instruction execute involves either reading or writing data into memory. In a standard processor, without Harvard architecture, the program instructions (that is, the program code) and the data (operands) are held in one memory space; see Figure 3. Thus the fetching of the next instruction while the current one is executing is not allowed, because the fetch and execution phases each require memory access.
IR PC
Instruction 1 Instruction 2 Instruction 3
LDA ADR1 STA ADR2 STA ADR3
MPU

(a)
ADR1 ADR2 ADR3
Fetch 1
Decode Execute 1 1 Fetch 2 Decode Execute 2 2 Fetch 3 Decode Execute 3 3
(b)
Figure 3. An illustration of instructions fetch, decode, and execute in a Non-Harward architecture with single memory space. (a) instruction fetch from memory (b) timing diagram
In a Harvard architecture (Figure 4), since the program instructions and data lie in separate memory spaces, the fetching of the next instruction can overlap the execution of the current instruction; see Figure 5. Normally, the program memory holds the program code, while the data memory stores variables such as the input data samples.
Data memory address bus
Program memory address bus
Digital Signal Processor
Program Memory
Data Memory
Program data bus
Data bus
Figure 4. Basic Harvard architecture with separate data and program memory spaces. It may be seen from Figure 4 that data and program instruction fetches can be overlapped as two independent memories are used in the architecture. This is explained with the help of the timing diagram as shown in Figure 5 below.
Clock Fetch Decode Fetch Execute Decode Fetch Execute Decode Execute
Figure 5. An illustration of instruction overlap made possible by the Harvard architecture. Strict Harvard architecture is used by some digital signal processors (for example Motorola D5P56000), but most use a modified Harvard architecture (for example, the TMS32O family of processors). In the modified architecture used by the TMS32O, for example, separate program and data memory spaces are still maintained, but communication between the two memory spaces is permissible, unlike in the strict Harvard architecture. Pipelining Pipelining is a technique which allows two or more operations to overlap during execution. In pipelining, a task is broken down into a number of distinct subtasks which are then overlapped during execution. It is used extensively in digital signal processors to increase speed. A pipeline is akin to a typical production line in a factory, such as a car or television assembly plant. As in the production line, the task is broken clown into small, independent subtasks called pipe stages. The pipe stages are connected in series to form a pipe and the stages executed sequentially. As we have seen in the last example, an instruction can be broken down into three steps. Each step in the instruction can be
regarded as a stage in a pipeline and so can be overlapped. By overlapping the instructions, a new instruction is started at the start of each clock cycle as shown in Figure 6(a).
Clock
Pipe Stage Pipe Stage Pipe Stage 1 2 3 Pipe Stage Pipe Stage Pipe Stage 1 2 3 Pipe Stage Pipe Stage Pipe Stage 1 2 3
Figure 6 (a) Figure 6(b) gives the timing diagram for a three-stage pipeline, drawn to highlight the instruction steps. Typically, each step in the pipeline takes one machine cycle.
Clock i Instruction fetch Instruction decode Instruction execute i-1 i-2 i i-1 i+1 i i+2 i+1 i+2 i+1 i+2
Figure 6 (b) Thus during a given cycle up to three different instructions may be active at the same time, although each will be at a different stage of completion. The key to an instruction pipeline is that the three
10
parts of the instruction (that is, fetch, decode and execute) are independent and so the execution of multiple instructions can be overlapped. In Figure 6(b), it is seen that, at the ith cycle, the processor could be simultaneously fetching the ith instruction, decoding the (i - 1)th instruction and at the same time executing the (i -2)th instruction, which are then overlapped during execution. It is used extensively in digital signal processors to increase speed. Figure 6(b) gives the timing diagram for a three-stage pipeline, drawn to highlight the instruction steps. Typically, each step in the pipeline takes one machine cycle. Thus during a given cycle up to three different instructions may be active at the same time,although each will be at a different stage of completion. The key to an instruction pipeline is that the three parts of the instruction (that is, fetch, decode and execute) are independent and so the execution of multiple instructions can be overlapped. In Figure 6(b), it is seen that, at the ith cycle, the processor could be simultaneously fetching the ith instruction, decoding the (i-1)th instruction and at the same time executing the (i -2)th instruction. The threestage pipelining discussed above is based on the technique used in the Texas Instruments TMS320 processors. As in other applications of pipelining, in the TMS 320 a number of registers are used to achieve the pipeline: a pre-fetch counter holds the address of the next instruction to be fetched, an instruction register holds the instruction to be executed, and a queue instruction register stores the instructions to be executed if the current instruction is still executing. The program counter contains the address of the next instruction to execute. By exploiting the inherent parallelism in the instruction stream, pipelining leads to a significant reduction on average, of the execution time per instruction. The throughput of a pipeline
11
machine is determined by the number of instructions through the pipe per unit time. As in a production line, all the stages in the pipeline must be synchronized. The time for moving an instruction from one step to another within the pipe (see Figure 6(a)) is one cycle and depends on the slowest stage in the pipeline. In a perfect pipeline, the average time per instruction is given by time per instruction (non-pipeline) / number of pipe stages (1) In the ideal case, the speed increase is equal to the number of pipe stages. In practice, the speed increase will be less because of the overheads in setting up the pipeline and delays in the pipeline registers, and so on. Example l In a non-pipeline machine, the instruction fetch, decode, and execute take 35 ns 25 ns, and 40 ns, respectively. Determine the increase in throughput if the instruction steps were pipelined. Assume a 5 ns pipeline overhead at each stage, and ignore other delays. In the non-pipeline machine, the average instruction time is simply the sum of the execution time of all the steps: 35 + 25 + 40 ns = 100 ns. However, if we assume that the processor has a fixed machine cycle with the instruction steps synchronized to the system clock, then each instruction would take three machine cycles to complete: 40 ns x 3 = 120 ns. (Since the slowest cycle is 40 ns) This corresponds to a throughput of 8.3x I06 instructions per second. In the pipeline machine, the clock speed is determined by the speed of the slowest stage plus overheads.In our case, the
12
machine cycle is 40 + 5 = 45 ns. This places a limit on the average instruction execution time. The throughput (when the pipeline is full) is 22.2 x106 instructions per second. Then speedup = average instruction time (non-pipeline) /average instruction time (pipeline) = 120/45 = 2.67 times (assuming non-pipeline executes in three cycles) In the pipeline machine, each instruction still takes three clock cycles, but at each cycle the processor is executing up to three different instructions. Pipelining increases the system throughput, but not the execution time of each instruction on its own. Typically, there is a slight increase in the execution time of each instruction because of the pipeline overhead. Pipelining has a major impact on the system memory. The number of memory accesses in a pipeline machine increases, essentially by the number of stages. In DSP the use of Harvard architecture, where data and instructions lie in separate memory spaces, promotes pipelining. When a slow unit, such as a data memory, and an arithmetic element are connected in series, the arithmetic unit often waits idly for a good deal of the time for data. Pipelining may be used in such cases to allow a better utilization of the arithmetic unit. The next example illustrates this concept. Example 2 Most DSP algorithms are characterized by multiply-and accumulate operations typified by the following equation:
13
a0x(n) + a1x(n- 1)+ a2x(n -2)+. . . +aN-1x(n -(N-I)) Figure 7 shows a non-pipeline configuration for an arithmetic element for executing the above equation. Assume a transport delay of 200 ns, 100 ns, and 100 ns, respectively, for the memory, multiplier and accumulator.
Coefficient Memory aN-1 aN-2 Data Memory x[n-(N-1)] x[n-(N-2)]
. .
a2 a1 a0
. .
x(n-2) x(n-1) x(n) TM = 200 ns
Multiplier
Tx = 200 ns
Ta = 200 ns
Figure 7. Non-pipelined MAC configuration. Products are clocked into the accumulator every 400 ns. 1. What is the system throughput? 2. Reconfigure the system with pipelining to give a speed increase of 2: 1, Illustrate the operation of the new
14
configuration with a timing diagram. Solution: 1. The coefficients, and the data arrays are stored in memory as shown in Figure 7. In the non-pipelined mode, the coefficients and data are accessed sequentially and applied to the multiplier. The products are summed in the accumulator. Successive multiplication-accumulation (MAC) will be performed once every 400 ns (200 + 100 + 100), giving a throughput of 2.5 x 106 operations per second. 2. The arithmetic operations involved can be broken up into three distinct steps: memory read, multiply, and accumulate. To improve speed these steps can be overlapped. A speed improvement of 2:1 can be achieved by inserting pipeline registers between the memory and multiplier and between the multiplier and accumulator as shown in Figure 8
15
Coefficient Memory aN-1 aN-2
Data Memory x[n-(N-1)] x[n-(N-2)]
. .
a2 a1 a0
. .
x(n-2) x(n-1) x(n)
Pipeline Register
Pipeline Register
Multiplier
Product Register
Figure 8. Pipelined MAC configuration. The pipeline registers serve as temporary store for coefficient and data sample pair. The product register also serves as a temporary store for the product.
16
The timing diagram for the pipeline configuration is shown in Figure 9. As is evident in the timing diagram, the MAC is performed once every 200 ns. The limiting factor is the basic transport delay through the slowest element, in this case the memory. Pipeline overheads have been ignored.
Clock
1st MAC
Read x(0)
Multiply Accumulate a0x(0) Read x(1) 0+a0x(0) Multiply Accumulate a1x(1) Read x(2) a0x(0)+a1x(1) Multiply a2x(2)
Accumulate
2nd MAC
3rd MAC
a0x(0)+a1x(1) +a2x(2)
Figure 9. Timing diagram for a pipelined MAC unit. When the pipeline is full, a MAC operation is performed every clock cycle (200 ns). DSP algorithms are often repetitive but highly structured, making them well suited to multilevel pipelining. For example, FFT requires the continuous calculation of butterflies. Although each butterfly requires different data and coefficients the basic butterfly arithmetic operations are identical. Thus arithmetic units such as FFT processors can be tailored to take advantage of this. Pipelining ensures a steady flow of instructions to the CPU, and in general leads to a significant increase in system throughput.
17
However, on occasions pipelining may cause problems. For example, in some digital signal processors, pipelining may cause an unwanted instruction to be executed, especially near branch instructions, and the designer should be aware of this possibility. Hardware multiplieraccumulator: The basic numerical operations in DSP are multiplications and additions. Multiplication, in software, is notoriously time consuming. Additions are even more time consuming if floating point arithmetic is used. To make real-time DSP possible a fast, dedicated hardware multiplier-accumulator (MAC) using fixed or floating point arithmetic is mandatory. Fixed or floating hardware MAC: is now standard in all digital signal processors. In a fixed point processor, the hardware multiplier typically accepts two I 6bit 2s complement fractional numbers and computes a 32-bit product in a single cycle (25 ns typically) The average MAC instruction time can be significantly reduced through the use of special repeat instructions.
18
A typical DSP hardware MAC configuration is depicted in Figure 10. In this configuration the multiplier has a pair of input registers that hold the inputs to the multiplier, and a 32-bit product register which holds the result of a multiplication. The output of the P (product) register is connected to a double-precision accumulator, where the products are accumulated.
X data
Y data
X register 16
Y register 16
/
P register 32
Figure 10. A typical MAC configuration in DSPs.

32
R register
19
The principle is very much the same for hardware floating-point multiplier - accumulators, except that the inputs and products are normalized floating- point numbers. Floating-point MACs allow fast computation of DSP results with minimal errors. The DSP algorithms such as FIR and IIR filtering suffer from the effects of finite word-length (coefficient quantization and arithmetic errors). Floating point offers a wide dynamic range and reduced arithmetic errors, although for many applications the dynamic range provided by the fixed-point representation is adequate. General-purpose digital signal processors: General-purpose digital signal processors are basically high speed microprocessors with hardware architectures and instruction sets optimized for DSP operations. These processors make extensive use of parallelism, Harvard architecture, pipelining and dedicated hardware whenever possible to perform time-consuming operations, such as shifting/scaling, multiplication, and so on. General-purpose DSPs have evolved substantially over the last decade as a result of the never-ending quest to find better ways to perform DSP operations, in terms of computational efficiency, ease of implementation, cost, power consumption, size, and application-specific needs. The insatiable appetite for improved computational efficiency has led to substantial reductions in instruction cycle times and, more importantly, to increasing sophistication in the hardware and software architectures. It is now common to have dedicated, on-chip arithmetic hardware units (e.g. to support fast multiply / accumulate operations), large on-chip memory with multiple access and special instructions for efficient execution of inner core computations in DSP. There is also a trend towards increased data word sizes (e.g. to maintain signal quality) and increased parallelism (to increase both the number of instructions executed in one cycle and the number of
20
operations performed per instruction). Thus, in newer generalpurpose DSP processors increasing use is made of multiple data paths/arithmetic to support parallel operations. DSP processors based on SIMD (Single Instruction, Multiple Data), VLIW (Very Large Instruction Word) and superscalar architectures are being introduced to support efficient parallel processing. In some DSPs, performance is enhanced further by using specialized, on-chip coprocessors to speed up specific DSP algorithms such as FIR filtering and Viterbi decoding. The explosive growth in communications and digital audio technologies has had a major influence in the evolution of DSPs, as has growth in embedded DSP processor applications. Fixed Point Digital Signal Processors: Fixed-point DSP processors available today differ in their detailed architecture and the onboard resources provided. A summary of key architectures ot four generations of fixed-paint- DSP processors from four leading semiconductor manufacturers is given in Table 1. The classification of DSP processors into the four generations is partly based on historical reasons, architectural features, and computational performance. The basic architecture of the first generation fixed-point DSP processor family (TMS32OCIx), first introduced in 1982 by Texas instruments, is depicted in Figure 11.
21
Program Memory 16
MUX 16 Program memory bus
Data bus 16 Data Memory 16 16
Input registers 16 x 16 bit multiplier 16
32-bit ALU
32-bit accumulator
Data bus
Figure 11 A simplified architecture of a first generation fixed-point DSP processor (Texas Instruments TMS32OCIO). Key features of the TMS32OCIx are the dedicated arithmetic units which include a multiplier and an accumulator. The processor family has a modified Harvard architecture with two separate memory spaces for programs and data. It has an on-chip memory
22
and special instructions for execution of basic DSP algorithms, although these are limited. Second generation fixed-point DSPs have substantially enhanced features as compared to the first generation. In most cases, these include much larger on-chip memories and more special instructions to support efficient execution of DSP algorithms. As a result, the computational performance of second generation DSP processors is four to six times that of the first generation. Typical second generation DSP processors include Texas Instruments TMS320C5x, Motorola DSP5600x, Analog Devices ADSP2 I xx and Lucent Technologies DSPI6xx families. Texas Instruments first and second generation DSPs have a lot in common, architecturally, but second generation DSPs have more features and increased speed. The internal architecture that typifies the TMS320C5x family of processors is shown in Figure 12 in a simplified form to emphasize the dual internal memory spaces which are characteristic of the Harvard architecture.
23
2k Program ROM
9.5 k Program/Data RAM
0.5 k Data RAM
16-bit program data bus

MUX
16-bit external data bus
16-bit data bus
P register T register ALU 16 x 16 multiplier
Accumulator Shifters Arithmetic Units
Figne 12. A simplified architecture of a second generation fixedpoint DSP (Texas Instruments TMS320C5O). Special instructions for DSP operations include a multiply and accumulate with data move instruction which, for example, can be combined with a repeat instruction to execute an FIR filter with considerable time savings. Its bit-reversed addressing capability is useful in FFTs. Unlike the first generation fixed-point processor family Clx, which has a very limited internal memory, the C5x provides more on-chip memory.
24
The Motorola DSP5600x processor is a high-precision fixed point digital signal processor. Its architecture is depicted in Figure 13.
Program Memory ROM/RAM X data memory Y data memory
24-bit data bus
Data Bus switch

24-bit External Data Bus
24-bit X data bus
Internal data paths
24-bit Y data bus
24-bit global data bus
24 x 24/56-bit MAC
Two 56-bit Accumulators Arithmetic units
Figure 13. A simplified architecture of a second generation fixedpoint DSP (Motorola D5P56002). Internally, it has two independent data memory spaces, the Xdata and Y- data memory spaces, and one program memory space. Having two separate data memory spaces allows a natural partitioning of data for DSP operations and facilitates the
25
execution of the algorithm. For example, in graphics applications data can be stored as X and Y data, in FIR filtering as coefficients and data, and in FFT as real and imaginary. During program execution, pairs of data samples can be fetched or stored in internal memory simultaneously in one cycle. Externally, the two data spaces are multiplexed into a single data bus, reducing somewhat the benefits of the dual internal data memory. The arithmetic units consist of two 56-bit accumulators and a single cycle, fixed-point hardware multiplier-accumulator (MAC). The MAC accepts 24-bit inputs and produces a 56-bit product. The 24bit word length provides sufficient accuracy for representing most DSP variables while the 56-bit accumulator (including eight guard bits) prevents arithmetic overflows. These word lengths are adequate for most applications, including digital audio, which imposes stringent requirements. The 5600x processors provide special instructions that allow zero-overhead looping and bit reversed addressing capability for scrambling input data before FFT or unscrambling the fast Fourier transformed data. Analog Devices ADS P2 lxx is another family of second generation fixed- point DSP processors with two separate external memory spaces - one holds data only, and the other holds program code as well as data. A simplified block diagram of the internal architecture of the ADSP2 lxx is depicted in Figure 14.
26
Memory units
Program Memory
Data Memory
Program memory path (24-bits)
data memory path (16-bits)
Arithmetic units
ALU
MAC
Shifter
Figure 14. A simplified architecture of a second generation fixedpoint DSP (Analog Devices ADS P2100). The main components are the ALU, multiplier--accumulator, and shifters. The MAC accepts 16 x 16-bit inputs and produces a 32bit product in one cycle. The accumulator of the ADSP2 lxx has eight guard bits which may be used for extended precision. The ADSP2 1 xx departs from the strict Harvard architecture, as it allows the storage of both data and program instructions in the program memory. A signal line (data access signal) is used to indicate when data and not program instructions are being fetched from the program memory. Storage of data in the program memory inhibits a steady data flow through the CPU as data and instruction fetches cannot occur simultaneously. To avoid a bottleneck, the ADSP2 I xx family has an on-chip program memory cache which holds the last 16 instructions executed. This eliminates the need, especially when executing program loops, for repeated instruction fetches from program memory. The ADSP2 lxx provides special instructions for zero-overhead looping and
27
supports a bit-reversing addressing facility for FFT. The processor family has a large on-chip memory (up to 64 Kbytes of internal RAM is provided for increased data transfer). The processor has an excellent support for DMA. External devices can transfer data and instructions to or from the DSP processor RAM without processor intervention. Lucent Technologies DSP l6xx family of fixed-point DSPs (see Figure 15) is targeted at the telecommunications and modem market.
Memory units
Program memory
Data memory
Cache
16-bits X data bus
16-bits Y data bus
16 x 16 bits multiplier
ALU
Two 36-bits accumulators

Arithmetic units
Figure 15. A simplified architecture of Lucent Technologies DSP l6xx fixed-point DSP.
28
In terms of computational performance, it is one of the most powerful second generation processors. The processor has a Harvard architecture, and like most of the other second generation processors, it has two data paths, the X and Y data paths. Its data arithmetic units include a dedicated 16 x 16- bit multiplier, a 36-bit ALU/shifter (which includes four guard bits) and dual accumulators. Special instructions such as those for zerooverhead single and block instruction looping are provided. Third generation fixed point DSPs are essentially enhancements of second generation DSPs. In general, performance enhancements are achieved by increasing and/or making more effective use of available On-Chip resources. Compared to the second generation DSPs, features of the third generation DSPs include more data paths (typically three compared to two in the second generation), wider data paths, larger on-chip memory and instruction cache and in some cases a dual MAC. As a result, the performance of third generation DSPs is typically two or three times superior to that of the second generation DSP processors of the same family. Simplified architectures of three third generation DSP processors, TMS320C54x, DSP563x and DSPI6000, are depicted in Figures 16, 17 and 18.
29
16 K word Program ROM
8 K word Prog /data RAM
24 K word Prog /data RAM

M U L T I P L E D A T A B U S
Program data bus
C data bus
D data bus
MAC 17 x 17-bit multiplier 40-bit adder Round/ Scale 40-bit shifter
ALU 40 bit ALU Viterbi accelerator 2 x 40 bit acculumulator
Arithmetic units
Figure 16. A simplified architecture of a third generation fixedpoint DSP (Texas Instruments TMS320C54x)
30
Program cache 4 K words
Memory units X data RAM 2 K words
Y data RAM 2 K words
Program data bus
X data bus
Y data bus
Data ALU
24 x 24-bits MAC
2 x 56-bits accumulator
Shifter
Figure 17. A simplified architecture of a third generation fixedpoint DSP (Motorola DSP56300).
31
Memory units Program Data memory memory
32-bit X data bus
32-bit Y data bus
MAC 16 x 16 ALU
MAC 16 x 16 Adder
Eight 40-bits accumulator Arithmetic unit
Figure 18. A simplified architecture of a third generatIon fixedpoint DSP (Lucent Technologies DSP 16000).
32
Most of the third generation fixed point DSP Processors are aimed at applications in digital communication and digital audio, reflecting the enormous growth and influence of these application areas on DSP processor development. Thus there are features in some of the processors that support these applications. In the third generation processors, semiconductor manufacturer have also taken the issue of power consumption seriously because of its application in portable and hand held devices. Fourth generation fixed point processors with their new architectures are primarily aimed at large and/or emerging multi channel applications, such as digital subscribers loop, remote access server modems, wireless base stations third generation mobile systems and medical imaging. The new fixed point architecture that has attracted a great deal of attention in the DSP community is the very long instruction word (VLIW). The new architecture makes extensive use of parallelism whilst retaining some of the good features of previous DSP processors. Compared to previous generations, fourth generation fixed point DSP processors, in general, have wider instruction words, wider data paths, more registers, larger Instruction cache and multiple arithmetic units, enabling them to execute many more instructions and operations per cycle. Texas Instruments TMS320C62x family of fixed-point DSP processors is based on the VLIW architecture as shown in Figure 19.
33
On-chip memory units Program RAM Data RAM
256-bits program data bus
32-bits data bus A
32-bits data bus B
Instructions fetch, dispatch and decode Data path 1 Register file 1 L1 S1 M1 D1 Data path 2 Register file 2 L2 S2 M2 D2
Figure 19. A simplified architecture of a fourth generation fixed-point, very long instruction word, DSP processor (Texas Instruments TMS320C62x). Note the two independent arithmetic data paths, each with four execution units -L1, S1, M1 and D1; L2, S2, M2 and D2. The core processor has two independent arithmetic paths, each with four execution units - a logic unit (Li), a shifter/logic unit (Si), a multiplier (Mi) and a data address unit (Di). Typically, the core
34
processor fetches eight 32- bit instructions at a time, giving an instruction width of 256 bits (and hence the term very long instruction word). With a total of eight execution units four in each data path, the TMS320C62x can execute up to eight instructions in parallel in one cycle. The processor has a large program and data cache memories (typically, 4 Kbyte of level 1 program/data caches and 64 Kbyte of level 2 program/data cache). Each data path has its own register file (sixteen 32-bit registers), but can also access registers on the other data path. Advantages of VLIW architectures include simplicity and high computational performance. Disadvantages include increased program memory usage (organization of codes to match the inherent parallelism of the processor may lead to inefficient use of memory). Further, optimum processor performance can only be achieved when all the execution units are busy which is not always possible because of data dependencies, instruction delays and restrictions in the use of the execution units. However, sophisticated programming tools are available for code packing, instruction scheduling, resource assignment, and in general to exploit the vast potential of the processor. Floating-point digital signal processors: The ability of DSP processors to perform high speed, high precision DSP operations using floating point arithmetic has been a welcome development. This minimizes finite word length effects such as overflows, round-off errors, and coefficient quantization errors inherent in DSP. It also facilitates algorithm development, as a designer can develop an algorithm on a large computer in a high level language and then port II to a DSP device more readily than with a fixed point. Floating-point DSP processors retain key features of fixed-point processors such as special instructions for DSP operations and multiple data paths for multiple operations. As in the case of fixedEKT353 Lecture Notes by Professor Dr. Farid Ghani
35
point DSP processors, floating point DSP processors available are significantly different architecturally. The TMS320C3x is perhaps the best known family of first generation general- purpose floating-point DSPs. The C3x family are 32-bit single chip digital signal processors and support both integer and floating-point arithmetic operations. They have a large memory space and are equipped with many on-chip peripheral facilities to simplify system design. These include a program cache to improve the execution of commonly used codes, and onchip dual access memories. The large memory spaces cater for memory intensive applications, for example graphics and image processing. In the TMS320C30, a floating-point multiplication requires 32-bit operands and produces a 40-bit normalized floating-point product. Integer multiplication requires 24-bit inputs and yields 32-bit results. Three floating- point formats are supported. The first is a 16-bit short floating-point format, with 4bit exponents, 1 sign bit and 11bits for mantissa. This format is for immediate floating-point operations. The second is a singleprecision format with an 8-bit exponent, 1 sign bit and 23-bit fractions (32 bits in total). The third is a 40-bit extended precision format which has an 8-bit exponent, 1 sign bit and 31-bit fractions. The floating-point representation differs from that of standard IEEE, but facilities are provided to allow conversion between the two formats. The TMS320C3x combines the features of Harvard architecture (separate buses for program instructions, data and I/O) and Von Neumann processor (unified address space). The emphasis in the second generation, general-purpose floatingpoint DSPs is on multiprocessing and multiprocessor support. Key issues in multiprocessor support include inter-processor communication, DMA transfers and global memory sharing. The best known second generation floating-point DSP families are Texas instruments TMS320C4x and Analog Devices ADSP-21 06x SHARC (Super Harvard Architecture Computer). The C4x
36
shares some of the architectural features of the C3x, but it was designed for multiprocessing. The C40x family has good I/O capabilities it has six COMM ports for inter-processor communication and six 32-bit wide DMA channels for rapid data transfers. The architecture allows multiple operations to be performed in parallel in one instruction cycle. The C4x family supports both floating- and fixed-point arithmetic. The native floating-point data format in. the C40 differs from the IEEE 754/854 standard, although conversion between them can be readily accomplished. Analog Devices ADSP-2106x SHARC DSP processors are also 32-bit floating- point devices. They have large internal memory and impressive 1/0 capability 10 DMA channels to allow access to internal memory without intervention and six Link ports for inter-processor communications at high speed. The architecture allows shared global memory, making it possible for up to six SHARC processors to access each others internal RAM at up to full data rate. The ADSP-2106x family supports both the fixed-point and floating-point arithmetic. Its single precision floating-point format complies with the single precision IEEE 754/854 floating-point standard (24-bit mantissa and 8-bit exponent). The architecture also supports multiple operations per cycle. Third generation floating-point DSP processors take the concepts of parallelism much farther to increase both the number of instructions and the number of operations in a cycle to meet the challenges of multichannel and computationally intensive applications. This is achieved by the use of new architectures, the VLRV (very long instruction word) and superscalar architectures in particular. The two leading third generation floating-point DSP processor families are the Texas Instruments TMS320C67x and Analog Devices ADSP-TS001. The TMS320C67x family has the
37
same VLIW architecture as the advanced, fourth generation fixedpoint DSP processors, TMS320C62x. The Tiger SHARC DSP family supports mixed arithmetic types (fixed and floating point arithmetic) and data types (8-, 16-, and 32-bit numbers). This flexibility makes it possible to use the arithmetic and data type most appropriate for a given application to enhance performance. As with the TMS320C67x, the Tiger SHARC is aimed at large-scale, multi-channel applications, such as the third generation mobile systems (3G wireless), digital subscriber lines (xDSL) and remote, multiple access server modems for Internet services. Tiger SHARC, with its static superscalar architecture, combines the good features of VLIW architecture, conventional DSP architecture, and RISC computers. The processor has two computation blocks, each with a multiplier, ALU and 64-bit shifter. The processor can execute up to eight MAC operations per cycle with 16-bit inputs and 40-bit accumulation, two 40-bit MACs on 16-bit complex data or two 80bit MACs with 32-bit data. With 8-bit data, Tiger SHARC can issue up to 16 operations in a cycle. Tiger SHARC has a wide memory bandwidth, with its memory organized in three 128-bit wide banks. Access to data can be in variable data sizes - normal 32-bit words, long 64-bit words or quad 128-bit words. Up to four 32-bit instructions can be issued in one cycle. To avoid the use of large NOPs (which is a disadvantage of VLIW designs), the large instruction words may be broken down into separate short instructions which are issued to each unit independently.
38
Selecting digital signal processors: The choice of a DSP processor for a given application has become an important issue in recent years because of the wide range of processors available (Levy. 1999; Berkeley Design Technology, 1996, 1999). Specific factors that may be considered when selecting a DSP processor for an application include architectural features, execution speed, type of arithmetic and word length. 1. Architectural features Most DSP processors available today have good architectural features, but these may not be adequate for a specific application. Key features of interest include size of on-chip memory, special instructions and I/O capability. Onchip memory is an essential requirement in most real time DSP applications for fast access to data and rapid program execution. For memory hungry applications (e.g. digital audio , FAX/Modem, MPEG coding/decoding), the size of internal RAM may become an important distinguishing factor. Where internal memory is insufficient this can be augmented by high speed, off-chip memory, although this may add to system costs. For applications that require fast and efficient communication or data flow with the outside world, I/O features such interface to ADC and DACs, DMA capability and support for multiprocessing may be important. Depending on the application, a rich set of special instructions to support DSP operations are important, e.g. zero-overhead looping capability, dedicated DSP instructions, and circular addressing.
39
2. Execution speed Speed of DSP processors is an important measure of performance because of the time-critical nature of most DSP tasks. Traditionally, the two main units of measurement for this are the clock speed of the processor, in MHz, and the number of instructions performed, in millions of instructions per second (MIPS) or, in the case of floating-point DSP processors, in millions of floating-point operations per second (MFLOPS). However, such measures may be inappropriate in some cases because of significant differences in the way different DSP processors operate, with most able to perform multiple operations in one machine instruction. For example, the C62x family of processors can execute as many as eight instructions in a cycle. The number of operations performed in each cycle also differs from processor to processor. Thus, comparison of execution speed of processors based on such measures may not be meaningful. An alternative measure is based on the execution speed of bench-mark algorithms - e.g. DSP kernels such as FFT, FIR and IIR filters (Levy, 1998 Berkeley Design Technology, 1999). 3. Type of arithmetic The two most common types of arithmetic used in modern DSP processors are fixed- and floating-point arithmetic. Floating arithmetic is the natural choice for applications with wide and variable dynamic range requirements (dynamic range may be defined as the difference between the largest and smallest signal levels that can be represented or the difference between the largest signal and the noise floor, measured in decibels). Fixed- point processors are favored in low cost, high volume applications (e.g. celIular phones and computer disk drives). The use of fixed-point arithmetic
40
raises issues associated with dynamic range constraints which the designer must address. In general, floating processors are more expensive than fixed-point processors, although the cost difference has fallen significantly in recent years. Most floating-point DSP processors available today also support fixed-point arithmetic. 4. Word length Processor data word length is an important parameter in DSP as it can have a significant impact on signal quality, it determines how accurately parameters and results of DSP operations can be represented. In general, the longer the data word the lower the errors that are introduced by digital signal processing. In fixed-point audio processing, for example, a processor word length of at least 24 bits is required to keep the smallest signal level sufficiently above the noise floor generated by signal processing to maintain CD quality. A variety of processor word lengths are used in fixed-point DSP processors, depending on application . Fixed-point DSP processors aimed at telecommunications markets tend to use a 16-bit word length (e.g. TMS320C54x), whereas those aimed at high quality audio applications tend to use 24 bits (e.g. DSP56300). In recent years there is a trend towards the use of more bits for the ADC and DAC (e.g. Cirrus 24-bit audio codec, CS4228) as the cost of these devices falls to meet the insatiable demand for increased quality. Thus, there is likely to be an increased demand for larger processor word lengths for audio processing. In fixed-point processors, it may also be necessary to provide guard bits (typically I to 8 bits) in the accumulators to prevent arithmetic overflows during extended multiply and accumulate operations. The extra bits effectively extend the dynamic range available in the DSP processor. In most floating- point DSP processors, a 32-bit
41
data size (24-bit mantissa and 8-bit exponent) is used for single-precision arithmetic. This size is also compatible with the IEEE floating-point format (IEEE 754). Most floatingpoint DSP processors also have fixed-point arithmetic capability, and often support variable data size, fixed-point arithmetic.
42
TMS320C6416 DSP Board
43
TMS320C6416 DSP Board
44
Functional block and DSP core diagram for TMS320C6416 DSP

DSP Hardware: EKT353 Lecture Notes by Professor Dr. Farid Ghani

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSP Hardware: EKT353 Lecture Notes by Professor Dr. Farid Ghani

Uploaded by

Copyright:

Available Formats

1

Figure 1. A simplified architecture for standard microprocessor

EKT353 Lecture Notes by Professor Dr. Farid Ghani

ALU I/O Devices Shifter

Memory Unit X data Memory Y data Memory

X Data bus Y Data bus P Data bus

EKT353 Lecture Notes by Professor Dr. Farid Ghani

EKT353 Lecture Notes by Professor Dr. Farid Ghani

Instruction 1 Instruction 2 Instruction 3

LDA ADR1 STA ADR2 STA ADR3

Instruction 1 Instruction 1 Instruction 1

ADR1 ADR2 ADR3

LDA ADR1 STA ADR2 STA ADR3

Decode Execute 1 1 Fetch 2 Decode Execute 2 2 Fetch 3 Decode Execute 3 3

EKT353 Lecture Notes by Professor Dr. Farid Ghani

Program memory address bus

Digital Signal Processor

Program data bus

EKT353 Lecture Notes by Professor Dr. Farid Ghani

LDA ADR1 STA ADR2 STA ADR3

Instruction 1 Instruction 2 Instruction 3

EKT353 Lecture Notes by Professor Dr. Farid Ghani

EKT353 Lecture Notes by Professor Dr. Farid Ghani

Coefficient Memory aN-1 aN-2

Data Memory x[n-(N-1)] x[n-(N-2)]

EKT353 Lecture Notes by Professor Dr. Farid Ghani

Figure 10. A typical MAC configuration in DSPs.

EKT353 Lecture Notes by Professor Dr. Farid Ghani

MUX 16 Program memory bus

Data bus 16 Data Memory 16 16

Input registers 16 x 16 bit multiplier 16

EKT353 Lecture Notes by Professor Dr. Farid Ghani

9.5 k Program/Data RAM

0.5 k Data RAM

16-bit program data bus

16-bit external data bus

16-bit data bus

P register T register ALU 16 x 16 multiplier

Accumulator Shifters Arithmetic Units

EKT353 Lecture Notes by Professor Dr. Farid Ghani

24-bit data bus

Data Bus switch

24-bit X data bus

Internal data paths

24-bit Y data bus

24-bit global data bus

Two 56-bit Accumulators Arithmetic units

EKT353 Lecture Notes by Professor Dr. Farid Ghani

Program memory path (24-bits)

data memory path (16-bits)

16-bits X data bus

16-bits Y data bus

Two 36-bits accumulators

EKT353 Lecture Notes by Professor Dr. Farid Ghani

16 K word Program ROM

8 K word Prog /data RAM

24 K word Prog /data RAM

Program data bus

MAC 17 x 17-bit multiplier 40-bit adder Round/ Scale 40-bit shifter

ALU 40 bit ALU Viterbi accelerator 2 x 40 bit acculumulator

EKT353 Lecture Notes by Professor Dr. Farid Ghani

Program cache 4 K words