ECT303 M5 Ktunotes - in

COMPUTER ARCHITECTURE
12%1 Introduction
can be divided into

For convenience, DSP processors two broad
purpose and special purpose.
categories: general
DSP processors include fixed-noi
ixed-point
Texas Instruments TMS320C.4X, and Motorola DSP563x Droceco.devices such as
point processors such as

Texas Instruments TMS320C4x and ing
ADSP21xxx SHARC processors. og Devices
There are two types of special-purpose hardware.
(1) Hardware designed for efficient execution of specific DSP algorithmns aual
digital filters, fast Fourier transform. This type of special-purp0se
dware is
sometimes called algorithm-specific digital signal processor,
an
(2) Hardware designed for specific applications, for example

control
telecommunication ns,
digital audio, or applications. This type of hardware is sometimes
called an application-specific digital signal processor.
In most cases
application-specific digital signal processors execute specific algo-
rithms, such PCM encoding/decoding, but are also
as
required to perform other
application-specific operations. Examples of special-purpose DSP processors are
Cirrus's processor for digital audio
sampling rate converters (CS8420), Mitel's multi-
channel telephony voice echo canceller
and
(MT9300), FFT processor (PDSP16515A)
programmable FlR filter (VPDSP16256).
Both general-purpose and
special-purpose
chips or with individual blocks of multipliers, processors
can be
designed with single
ALUs, memories, and so on.
First, we will discuss the architectural
made real-time DSP in features of digital signal processors that have
many areas
possible.
12.2 Computer architectures for

signal processing
Most general purpose processors available
concepts, where operations are today are based on the von Neumann
fied architecture for a performed sequentially. Figure 12.1 shows a simpl
standard von Neumann processor. When an
processed in such a
processor, units of the instruction is
phase wait idly until control is processor not involved at each instruction
passed on to them. Increase in
processor speed is
Downloaded from Ktunotes.in

12.2 Computer architectures for signal processing 729
Address
generator
Address bus
Optional
ALU
1/0 Multiplier
devices
Program
and data
Accumulator Product memory
register
Figure 12. Data bus

A
simplified architecture for standard
microprocessors.
Arithmetic units
Multiplier Memory units

accumulator
1/0
devices
X data Yodata
memory Program
Shifter memory memory
X data bus
Y data bus
P data bus
Figur
se 12.2 Basic generic hardware architecture for signal
processing.
achieved by making the individual units operate faster, but there is a limit on
how fast
they can be made to operate.
If it is to operate in real time, a DSP prOcessor must have its architecture
optimized
for executing DSP functions. Figure 12.2 shows a generic hardware architecture
suitable for real-time DSP. It is characterized by the following:

General- and special-purpose digital signal processors
50 Chapter 12
Multiple bus structure with separate memory space for data and orogto
m
instructions. Typically the data memories hold input data, intermediate
data
values and output samples, as well as nxed coCIncients for, for example, dioital
filters or FFTs. The program instructions are stored in the program mem ital
1ory.
The 1/0 port provides a means of passing data to and fronm external devices sch
as the ADC and DAC or for passing digital data to other processors. Direct
memory access (DMA), if available, alloWs for rapid transfer of blocks af d.s.
ata
from data RAM, typically under external control.
directly to or
Arithmetic units for logical and arithmetic operations, which include an ALU,
a hardware multiplier and shifters (or multiplier-accumulator).
Why is such an architecture necessary? Most DSP algorithms (such as fitering

correlation and fast Fourier transform) involve repetitive arithmetic operations such
as multiply, add, memory accesses, and heavy data flow through the CPU. The
architecture of standard microprocessors is not suited to this type of activity. An
important goal in DSP hardware design is to optimize both the hardware architecture
and the instruction set for DSP operations. In digital signal processors, this is achieved
u'otkrd by making extensive use of the concepts of parallelism. In particular, the following
techniques are used:
vOrd
Harvard architecture; Smaske|2mat
pipelining:
fast, dedicated hardware multiplier/accumulator; Swvt
drcu re). d
, special instructions dedicated to DSP;
ncud avol y u
replication:
on-chip memory/cache;
extended parallelism - SIMD, VLIW and static superscalar processing.
For successful DSP design, it is important to understand these key architectural

features.
12.2.1 Harvard architecture

The principal feature of the Harvard architecture is that the program and data
memories lie in two separate spaces, permitting a full overlap of instruction fetch an
execution. Standard microprocessors, such as the Intel 6502, are characterized by a
single bus structure for both data and instructions, as shown in Figure 12..
Suppose that in a standard microprocessor we wish to read a value op1 at address
ADR1 in memory into the accumulator and then store it at Iwo other addresses, ADR2
and ADR3. The instructions could be
LDA ADR1 load the operand op1 into the accumulator from ADR1
STA ADR2 store op1 in address ADR2
STA ADR3 store op1 in address ADR3

12.2
Computer architectures for
signanal processing 731
IR
Instruction 1 LDA ADR1

Instruction 2 STA
Instruction 3 ADR2
STA ADR3
MPU
OP 1
ADR1
ADR2
ADR3
(a)
Fetch Decode Execute

LDAADR1
STA ADR2 Fetch Decode Execute

2

3 3
(b)
Te 12.3 An illustration of instruction fetch, decode and execute in a
non-Harvard architecture with single
memory space: (a) instruction fetch from memory: (b) timing diagram.
Typically, each of these instructions would involve three distinct steps:
instruction fetch;
instruction decode;
instruction execute.
involves fetching the next instruction from memory,
In our case, the instruction fetch
either reading or writing In a
data into memory.
and instruction execute involves
Harvard architecture,
tne program instructions (that is, the
standard processor, without in one memory space; see Figure 12.3.
and the data (operands) are held
program code) while the current one is executing is not
of the next instruction
Thus the fetching each require memory access,
the fetch and execution phases
allowed, because the program instructions and data lie
architecture (Figure 12.4), SInce
In a Harvard instruction can
of the' next instruction can overlap the
the fetching propram
in separate memory
in
spaces, 12.5. Normally, the memory
instruction; see Figure
current variables such as the innut data
lata
execution of the stores
while the data memory
holds the p r o g r a mm code,
samples.
Cnaprer
Data memory address bus
Programmemory address bus
Digital Program Data

signal memory memory
processor
Program data bus
Data bus
Figure 12.4 Basic Harvard architecture with separate data and program memory spaces
Data and program instruction fetches can be overlapped as two independent
memories are used.
Clock

LDA ADR1
STA ADR2
Figure 12.5 An illustration of instruction

overlap made.possible by the Harvard architeture
Strict Harvard architecture is used by some digital signal processors (for
Motorola DSP56000), but most use a modified Harvard architecture (for
exampic
TMS320 family of exampe. t
processors). In the modified architecture used by the IMs
1orexample, separate program and data memory spaces are still maintaneu.
communication between the two striet
Harvard architecture. memory spaces is permissible, unike
12.2.2 Pipelining
Pipelining 1s a
technique which allows two or more
operations to overlap du
execution. In pipelining, a task is broken down into a
number of distinct suolsa

Tor Signal processing
755
L
Instruction 1 Pipestage Pipestage Pipestagee
Instruction 2
Pipestage Pipestage Pipestage
2
Instruction 3
Pipestage| Pipestage Pipestage
3
(a)
Instruction fetch i+1 i+2
i- 1 +1
Instruction decode i+2
Instruction execute
i+ L it2
(b)
Figure 12.6 An illustration of the concept of pipelining.
execution. It is used extensively in digital signal

which are overlapped
then during line in a
A pipeline is akin to a typical production
processors to increase speed. As in the production line, the task
or television assembly plant.
factory, such as a car called pipe stages. The pipe stages
independent subtasks
is broken down into small, and the stages executed
sequentially.
series to form a pipe
are connected in instruction can be broken down
into three
the last section, an
have seen in in a pipeline and so can
As we
be regarded as a stage
the instruction can the start
steps. Each step in n e w instruction is started at
instructions, a
the
be overlapped. By overlapping
of each clock cycle
(Figure 12.6(a). for a three-stage pipeline, drawn to
gives the timing diagram in the pipeline takes one machine

Figure 12.6(b) stepP
steps. Typically, each
the
instruction instructions may be active at
highlight the
different
The kev t o on
to three
a given cycle up stage of completion.
n
cycle. Thus during
different
be at a decode
each will i n s t r u c t i o n (that is, fetch,
although or
ne
same time,
that the three parts instructions a be
is e x e c u t i o n of multiple
instruction pipeline and s o the
proces.
independent at the cycle, the processor could
e ith cycle,
and execute)
are
it is s e e n that,
Figure
12.6(b),
i n s t r u c t i o n , decoding
1)th instruction
(i )th
coding the (i- instruction and
-
Overlapped, In ith the

fetching instruction.
be simultaneously
the (i
-
2)th
executing
same
time
al the

The three-stage pipelining discussed above is based on the technique used in the
Texas Instruments TMS320 processors. AS in other applications of pipelining. in the
TMS320 a number of registers are used to achieve the pipeline: a prefetch counter
holds the address of the next instruction to be fetched, an instruction register holds the
instruction to be executed, and a queue instruction register stores the instructions to he
executed if the current instruction is still executing. The program counter contains the
address of the next instruction to execute.
By exploiting the inherent parallelism in the instruction stream, pipelining leads
to a significant reduction, on average, of the execution time per instruction. The
throughput of a pipeline machine is determined by the number of instructions through
the pipe per unit time. As in a production line, all the stages in the pipeline must be
synchronized(The time for moving an instruction from one step to another within the
pipe (see Figure 12.6(a)) is one cycle and depends on the slowest stage in the pipeline
In a perfect pipeline, the average time per instruction is given by (Hennessy and
Patterson, 1990)
time per instruction (nonpipeline) (12.1)

number of pipe stages
In the ideal case, the speed increase is equal number-of pipe stages. In practice,
to the
and
the speed increase will be less because of the overheads in setting up the pipeline,
delays in the pipeline registers, and so on.
In the pipeline each instruction still takes three clock cycles, but at each
machine,
is
cycle the processor executing up to three different instructions. Pipelining increases
the system throughput, but not the execution time of each instruction on its own.
Typically, there is a slight increase in the execution time of each instruction because
of the pipeline overhead.
Pipelining has a major impact on the system memory. The number of memory
accesses in a pipeiine machine increases, essentially by the number of stages. In DSP
the use of Harvard architecture, where data and instructions lie in separate memory
spaces, promotes pipelining.
When a slow unit, such as a data memory, and an arithmetic element are connected
in series. the arithmetic unit often waits idly for a good deal of the time for data.
Pipelining may be used in such cases to allow a better utilization of the arithmetic
unit. The nexi example illustrates the concept.
DSP algorithms often repetitive but highly
are
to multilevel
structured, making them well suiled
pipelining. For example, FFT requires the continuous calculation
butterflies. Although each butterfly requires different data O
and coefficients the basic
butterfly arithmetic operations are identical. Thus arithmetic units such as FFT
processors can be tailored to take
advantage of this. Pipelining ensures a steady flow
of instructions to the CPU, and in
general leads to a significant increase in system
throughput. However, on occasions pipelining may cause problems. For example, in
some digital signal processors,
pipelining may cause an unwanted instruction to be
executed, especially near branch instructions, and the designer should be aware of this
possibility.
12.2.3 Hardware multiplier-accumulator

The basic numerical operations in DSP are multiplications and additions. Multiplica-
tion, in software, is notoriously time consuming. Additions are even more time
consuming if floating point arithmetic is used. To make real-time DSP possible a fast,
dedicated hardware multiplier-accumulator (MAC) using fixed or floating point
arithmetic is mandatory. Fixed or floating hardware MAC is now standard in all
digital signal processors. In a fixed point processor, the hardware multiplier typically

| X data Y data
Xregister Y register
16 16,
Pregister
32
32
R register
Eigare 12.10 A typical MAC configuration in DSPs.
accepts two 16-bit 2's complement fractional numbers and computes a 32-bit product
in a single cycle (25 ns typically). The average MAC instruction time can De
significantly reduced through the use of special repeat instructions.

A typical DSP hardware MAC configuration is depicted in Figure 12.10. In tns
configuration, the multiplier has a pair of input registers that hold the inputs to ine
multiplier, and a 32-bit product register which holds the result of a multiplication. Inc
output or the P (product) register is connected-to a double-precision accunuato
where the products are accumulated.
he principle is very much the same for hardware floating-point muiup
tiplier-
accumulators, except that the inputs and products are normalized floatingp
numbers.
Floating-point MACs allow fast computation of DSP results with min
errors. As discussed in
Chapters 7 and 8 DSP algorithms such as FIR ana t i c
Suffer from the effects of
finite wordlength (coefficient quantization and a rs
errors). Floating point offers a wide dynamic range and reduced arithmeu
although for many applications the dynamic rep
resentation is adequate. range provided by
the nxu
TMS320C67XX
Features of TMS320C67xx
◼ Advanced VLIW CPU with eight functional units
◼ Two multipliers & Six arithmetic units
◼ Executes up to eight 32-bit instructions per cycle
◼ Develop highly effective RISC-like code
◼ CPU consists of 32 general purpose registers (32- bit )
◼ Variable-width instructions: flexibility of data types
◼ 8/16/32-bit data support, providing efficient memory support
◼ Efficient code execution on independent functional units
◼ Industry’s most efficient C compiler on DSP benchmark suite
◼ Industry’s first assembly optimizer for fast development and improved
Parallelization
Hardware support for
◼ Single-precision (32-bit).
◼ Double-precision (64-bit) operations.
◼ 32 x 32-bit integer multiply with 32- or 64- bit result

Figure above is the block diagram for the c67xx DSP. The C6000 devices come with
program memory, which on some devices, can be used as a program cache. The devices
also have varying sizes of data memory. Peripherals such as a direct memory access (DMA)
controller, power – down logic, and external memory interface (EMIF) usually come with the
cpu, while peripherals such as serial ports and host ports are on only certain devices.
Central processing unit (CPU)

The CPU contains:
 Program fetch unit.
 Instruction dispatch unit.
 Instruction decode unit.
 Two data paths, each with four functions units.
 32 32-bit registers.
 Control logic.
 Test, emulation and interrupt logic.
From the program memory, all the instructions are taken and processed in CPU.
From the data memory all required data will be taken.

The program fetch, instruction dispatch and instruction decode unit can deliver up to
eight 32 bit instructions to the functional unit every CPU clock cycle.
General-Purpose Register Files

◼ There are two general-purpose register files (A and B) in the data paths. Each of
these files contains 16 32-bit registers (A0–A15 for file A and B0–B15 for file B).
The core processing units are called datapaths.The CPU core has 2 processing units
Datapath A and Datapath B. The processing of instruction occurs in each of the two
data paths, each contains four functional units and 16 registers. Total of 32 general
purpose registers are present. Each of these registers are 32 bit size. The eight
functional units are divided into two groups of four. 2 Multipliers and 6 ALUs.
L1, S1, D1, L2, S2 and D2 are ALUs. M1 and M2 are multipliers.
A control register file provides the means to configure and control various processor
operation.
◼ The general-purpose registers can be used for data, data address pointers, or
condition registers. ◼ The C67xx general-purpose register files support data ranging
in size from packed 16-bit data through 40-bit fixed-point and 64-bit floating point
data.
◼ Values larger than 32 bits, such as 40-bit long and 64-bit float quantities, are
stored in register pairs. In these the 32 LSBs of data are placed in an even-
numbered register and the remaining 8 or 32 MSBs in the next upper register (which
is always an odd-numbered register)
Internal Memory
The c67x DSP has a 32 bit, byte addressable address space. Internal memory is
organized in separate data and prog spaces. When off chip memory is used, these
spaces are unified on most devices to a single memory space via the external;
memory interface (EMIF).
Memory and peripheral options
A variety of memory and peripherals options are available for the C6000 platform.
 Large on chip RAM, up-to 7M bits
 Program cache holds the frequently accessing data and the size is 32 bit
address and 256 bit data.
 Data memory can also be used as Data cache. There are varied size as 8, 16
or 32 bit data size. 2 level cache.
 32 bit external memory interface supports SDRAM, SBSRAM, SRAM, and
other asynchronous memories for a board range of external memory
requirement and max system performance.

 The c67xx DSP uses VLIW (Very Large Instruction word) architecture. They
allow parallel processing. They club a set of instructions into a large word and
do parallel processing.
DMA
 DMA (Direct Memory Access) controller transfers data between address
ranges in the memory map without intervention by the CPU. There are 4
channels which are programmable and one auxillary channel which is non-
programmable.
 Extended DMA (EDMA) controller performs the same functions as the DMA
controller.
 EDMA controller which offers 16 programmable channels.
HPI
 Host Port Interface (HPI) is a parallel port through which a host processor can
directly access the cpu’s memory space. This is a dedicated port for
communication between the 2 processors.
Expansion bus
 Expansion bus is a replacement for the HPI, as well as an expansion of the
EMIF (External Memory Interface). When EMIF is used, the program memory
and data memory together will be used.
 There are 2 modes of operation – asynchronous mode and synchronous
mode. In asynchronous mode, it acts as slave. In synchronous both slave and
master modes are possible.
Multi Channel Buffered serial Port (McBSP)
 McBSP is based on the standard serial port interface found on the DSP
processor. McBSP allows full duplex communication. The McBSP consists of
a data path and a control path that connect to external devices.
Timers
Timers in the c6000 devices are two 32 bit general purpose timers used for
these functions
 Time event.
 Count event.
 Generate pulses.
 Interrupt the CPU.
 Send synchronization events to the DMA/EDMA controllers.
 Interrupt CPU
Power-down logic
 Power-down logic allows reduced power consumption. If the DSP is acting in
some power saving mode, the power down logic will be active.


ECT303 M5 Ktunotes - in

Uploaded by

Copyright:

Available Formats

ECT303 M5 Ktunotes - in

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ECT303 M5 Ktunotes - in

Uploaded by

Copyright:

Available Formats

COMPUTER ARCHITECTURE

can be divided into

point processors such as

(2) Hardware designed for specific applications, for example

12.2 Computer architectures for

Downloaded from Ktunotes.in

Figure 12. Data bus

Multiplier Memory units

Downloaded from Ktunotes.in

Why is such an architecture necessary? Most DSP algorithms (such as fitering

For successful DSP design, it is important to understand these key architectural

12.2.1 Harvard architecture

Downloaded from Ktunotes.in

Instruction 1 LDA ADR1

Fetch Decode Execute

STA ADR2 Fetch Decode Execute

STA ADR3 Fetch Decode Execute

Typically, each of these instructions would involve three distinct steps:

Data memory address bus

Programmemory address bus

Digital Program Data

Program data bus

Fetch Decode Execute

STA ADR3 Fetch Decode Execute

Figure 12.5 An illustration of instruction

Downloaded from Ktunotes.in

Instruction fetch i+1 i+2

Figure 12.6 An illustration of the concept of pipelining.

execution. It is used extensively in digital signal

gives the timing diagram in the pipeline takes one machine

Overlapped, In ith the

Downloaded from Ktunotes.in

time per instruction (nonpipeline) (12.1)

12.2.3 Hardware multiplier-accumulator

Downloaded from Ktunotes.in

Eigare 12.10 A typical MAC configuration in DSPs.

significantly reduced through the use of special repeat instructions.

Downloaded from Ktunotes.in

Central processing unit (CPU)

Downloaded from Ktunotes.in

General-Purpose Register Files

Downloaded from Ktunotes.in

Downloaded from Ktunotes.in

You might also like