Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

VLSI

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/2392486

A Digital VLSI Architecture for Real-World Applications

Article · December 2001


Source: CiteSeer

CITATIONS READS
7 470

1 author:

Dan Hammerstrom
Portland State University
70 PUBLICATIONS 1,703 CITATIONS

SEE PROFILE

All content following this page was uploaded by Dan Hammerstrom on 23 May 2014.

The user has requested enhancement of the downloaded file.


A Digital VLSI Architecture for
Real-World Applications
Dan Hammerstrom

INTRODUCTION tions per second, then one pass through the network
takes 1000 set (about 20 min). This data rate is much
too slow for real-time process control or speech rec-
As the other chapters of this book show, the neural ognition, which must update several times a second.
network model has significant advantages over tradi- Clearly, we have a problem.
tional mcdels for certain applications. It has also ex- This performance bottleneck is worse if each con-
panded our understanding of biological neural net- nection requires more complex computations, for in-
works b> providing a theoretical foundation and a set stance, for incremental learning algorithms or for more
of functional models. realistic biological simulations. Eliminating this com-
.Neural network simulation remains a computa- putational barrier has led to much research into build-
tional]!. intensive activity, however. The underlying ing custom Very Large Scale Integration (VLSI)
computations-generally multiply-accumulates-are silicon chips optimized for ANNs. Such chips might
simple but numerous. For example, in a simple artifi- perform ANN simulations hundreds to thousands of
cial neural network (ANN) model, most nodes are times faster than workstations or personal comput-
connected to most other nodes, leading to O(W) COKZ- ers-for about the same cost.
necrions: A network with 100,000 nodes, modest by The research into VLSI chips for neural network
biological standards, would therefore have about 10 and pattern recognition applications is based on the
billion connections, with a multiply-accumulate oper- premise that optimizing the chip architecture to the
ation needed for each connection. If a state-of-the-art computational characteristics of the problem lets
workstation can simulate roughly 10 million connec- the designer create a silicon device offering a big im-
‘The “o&r of” O(F(n)) notation means that the quantity repre-
provement in performance/cost or “operations per dol-
sented by 0 is approximate for the function F within a multiplica- lar.” In silicon design, the cost of a chip is primarily
tion or diGon by n. determined by its two-dimensional area. Smaller chips
L ,_.‘Llr iuwductm to Neural and Electronic Networks, Second Edition. Copyright Q 1995 by Academic Press, Inc. All rights of reproduction in any forin resewed.

335
336 Dan Hammerstrom ;,

are cheaper chips. Within a chip, the cost of an opera- overview of the CNAPS architecture and offers a ra-
tion is roughly determined by the silicon area needed tionale for its major design decisions. It also sum-
to implement it. Furthermore, speed and cost usually marizes the architecture’s limitations and describes
have an inverse relationship: faster chips are generally aspects that, in hindsight. its designers might have
bigger chips. done differently. The section ends with a brief dis-
The silicon designer’s goal is to increase the number cussion of the software developed for the machine
of operations per unit area of silicon, calledfunctional so far.
density, in turn, increasing operations per dollar. An The second section briefly reviews applications de-
advantage of ANN, pattern recognition, and image veloped for CNAPS at this writing.’ The applications
processing algorithms is that they employ simple, low- discussed are simple image processing, automatic tar-
precision operations requiring little silicon area. As a get recognition, a simulation of the Lynch/Granger
result, chips designed for ANN emulation can have a Pyriform Model, and Kanji OCR. Finally. to offer a
higher functional density than traditional chips such as broader perspective of real-world ANN usage. the
microprocessors. The motive for developing special- third section reviews non-CNAPS applications. specif-
ized chips, whether analog or digital, is this potential ically, examples of process control and financial
to improve performance, reduce cost, or both. analysis.
The designer of specialized silicon faces many other
choices and trade-offs. One of the most important is
flexibility versus speed. At the “specialized” end of
the flexibility spectrum, the designer gives up versatil- THE CNAPSARCHITECTURE
ity for speed to make a fast chip dedicated to one task.
At the “general purpose” end, the sacrifice is reversed, The CNAPS architecture consists of an array’ of pro-
yielding a slower, but programmable device. The cessors controlled by a sequencer. both implemented
choice is difficult because both traits are desirable. as a chip set developed by Adaptive Solutions. Inc.
Real-world neural network applications ultimately The sequencer is a one-chip device called the CNAPS
need chips across the entire spectrum. Sequencer Chip (CSC). The processor an-a!- is also a
This chapter reviews one such architecture, CNAPS2 one-chip device, available with either 64 or 16 proces-
(Connected Network of Adaptive Processors), devel- sors per chip (the CNAPS- 1064 or CNAPS- 10 16). The
oped by Adaptive Solutions, Inc. This architecture was CSC can control up to eight 1064s or 1016s. which act
designed for ANN simulation, image processing, and like one large device.
pattern recognition. To be useful in these related con- These chips usually sit on a printed circuit board
texts, it occupies a point near the “general purpose” that plugs into a host computer, also called the control
end of the flexibility spectrum. We believe that, for its processor (CP). The CNAPS board acts as a coproces-
intended markets, the CNAPS architecture has the sor within the host. Under the coprocessor model, the
right combination of speed and flexibility. One reason host sends data and programs to the board. which runs
for writing this chapter is to provide a retrospective until done, then interrupts the host to indicate comple-
on the CNAPS architecture after several years’ ex- tion. This style of operation is called “run to comple-
perience developing software and applications for it. tion semantics.” Another possible model is to use the
The chapter has three major sections, each framed CNAPS board as a stand-alone device to process data
in terms of the capabilities needed in the CNAPS com- continuously.
puter’s target markets. The first section presents an 3Because ANNs are becoming a key technology. many customers
consider their use of ANNs to be proprietary information. Many
‘Trademark Adaptive Solutions, Inc. applications are not yet public knowledge.
17. Digital VLSI Architecture for Real-World Problems 337

The CNAPS Architecture also has a direct input port and a direct output port
used to connect the CSC directly to I/O devices for
Basic Structure higher-bandwidth data movement.

CNAPS is a single instruction, multiple data stream


(SIMD) architecture. A SIMD computer has one in-
Neural Network Example
struction sequencing/control unit and many processor
nodes (PNs). In CNAPS, the PNs are connected in a The CNAPS architecture can run many ANN and non-
one-dimensional array (Figure 1) in which each PN ANN algorithms. Many SIMD techniques are the same
can “talk” only to its right or left neighbors. The se- in both contexts, so an ANN can serve as a general
quencer broadcasts each instruction plus input data to example of mapping an algorithm to the &-ray. Specif-
all PNs. which execute the same instruction at each ically, the example here shows how the PN array sim-
clock. The PNs transmit output data to the sequencer, ulates a layer in an ANN.
with se\.eral arbitration modes controlling access to Start by assuming a two-layered network (Figure 4)
the output bus. in which-for simplicity-each node in each layer
As Figure 2 suggests, each PN has a local memory,4 maps to one PN. PN, thus simulates the node n,;, where
a multiplier, an adder/subtracter, a shifter/logic unit, a i is the node index in the layer andj is the layer index.
register tile,’ and a memory addressing unit. The entire Layers are simulated in a time-multiplexed manner.
PN uses fixed-point, two’s complement arithmetic, and All layer 1 nodes thus execute as a block, then all layer
the precision is 16 bits, with some exceptions. The PN 2 nodes, and so on. Finally, assume that layer 1 has
memory can handle 8- or 16-bit reads or writes. The already calculated its various n,,, outputs.
multiplier produces a 24-bit output; an 8 X 16 or 8 X The goal at this point is to calculate the outputs for
8 multiply takes one clock, and a 16 X 16 multiply layer 2. To achieve this, all layer 1 PNs simultaneously
takes two clocks. The adder can switch between 16- or load their output values into a special output buffer
32-bit modes. The input and output buses are 8 bits and begin arbitration for the output bus. In this case,
wide, and a 16-bit word can be assembled (or disas- the arbitration mode lets each PN transmit its output in
sembled! from two bytes in two clocks. sequence. In one clock, the content of PN,‘s buffer is
A P5 has several additional features (Hammer- placed on the output bus and goes through the se-
Strom. 1990, 1991) including a function that finds the quencer6 to the input bus. From the input bus, the
PN with the largest or smallest values (useful for value is broadcast to all PNs (this out-to-in loopback
winner-take-all and best-match operations), various feature is a key to implementing layered structures ef-
precision and memory control features, and OutBus ficiently). Each PN then multiplies node n,,‘s output
arbitration. These features are too detailed to discuss with a locally stored weight, w,,~.
fully here. On the next clock, node n,,,‘s output is broadcast to
The CSC sequencer (Figure 3) performs program all PNs, and so on for the remaining layer 1 output
sequencing for the PN array and has private access values. After N clocks, all outputs have been broad-
to a program memory. The CSC also performs input/ cast, and the inner product computation is complete.
output (l/O) processing for the array, writing input All PNs then use the accumulated value’s most signif-
data to the array and reading output data from it. To icant 8 bits to look up an &bit nonlinear output value
move data to and from CP memory, the CSC has a 32- in a 256-item table stored in each PN’s local memory.
bit bus, called the AdaptBus, on the CP side. The CSC This process-calculating a weighted sum, then passing
d Currently 4 KB per PN. hThis operation actually takes several clocks and must be pipe-
5 Currently 32, 16-bit registers. lined. These details are eliminated here for clarity.
338 Dan Hammerstrom

OUT CNAPS CNAPS


Bus

Inter-PN
4 PNO- PNl .a............... PNfjj-
Bus
AA A4 AA
PNCMD
Bus
D r131
I
IN
Bus

FIGURE 1 The basic CNAPS architecture. CNAPS is a single instruction, multiple data (SIMD)
architecture that uses broadcast input, one-dimensional interprocessor communication, and a single
shared output bus.

Inter-PN Bus , 4 (2 in, 2 out)


I b

2 I
2
w *
output
Unit
&A
A Bus 16

B Bus, 16

,
Input
Unit

PNCMD Bus 131


b
IN Bus
+
FIGURE 2 The internal structure of a CNAPS processor node (PN). Each PN has its own storage
and arithmetic capabilities. Storage consists of 4096 bytes. Arithmetic operations include multiply,
accumulate, logic, and shift. All units are interconnected by two 16-bit buses.
17. Digital VLSI Architecture for Real-World Problems 339

l--l File
Menior~

Al.AYTlws Program PNCMD


4- + Memory BUS
SOMll~Vesls b

ALU
Control/
Sequencing Status
REG Unit
FILE CNAPS
Arra)
CP
Interface IN Bus
(CPIF) I/O Processing

:
t

. ..-... . .._..._\.\

tl Control i
Subsystem :
:,
1 Data L/O]

FIGURE 3 The CNAPS sequencer chip (CSC) internal structure. The CSC accesses an external
program store, which contains both CSC and CNAPS PN array instructions. PN array instructions
are broadcast to all PNs. CSC instructions control sequencing and all array input and output.

it through a function stored in a table-is performed An array of 256 PNs can compute 256* = 65536
for each output on each layer. The last layer transmits connections in 256 clocks. At a 25-MHz clock fre-
its output values through the CSC to an output buffer quency, this equals 6.4 billion connections per second
in the CP memory. (back-propagation feed-forward) and over 1 billion
The multiply-accumulate pipeline can compute a connection updates per second (back-propagation
connection in each clock. The example network has learning). An array of 64 PNs (one CNAPS-1064
four nodes and uses only four clocks for its 16 con- chip), for example, can store and train the entire
nections. For even greater efficiency, other operations NetTalk (Sejnowski & Rosenberg, 1986) network in
can be performed in the same clock as the multiply- about 7 sec.
accumulate. The separate memory address unit, for in-
stance. can compute the next weight’s address at the
Physical Implementation
same time as the connection computation; and the lo-
cal memory allows the weight to be fetched without The CNAPS PN array has been implemented in two
delay. chips, one with 64 PNs (the CNAPS-1064; Griffin
340 Dan Hammerstrom

CN4 CN5 CN6 CN7 Major Design Decisions


When designing the CNAPS architecture, a key ques-
tion was where it should sit relative to other computing
devices in cost and capabilities. In computer design,
flexibility and performance are almost always in-
versely related. We wanted CNAPS to be flexible
enough to execute a broad family of ANN algorithms
CNO CNl CN2 CN3
as well as other related pattern recognition and pre-
Broadcast by PNO of CNO’s output to CN4, 5, 6,7 processing algorithms. Yet, we wanted it to have much
takes 1 clock higher performance than state-of-the-art workstations
N* connections in N clocks and-at the same time-lower cost for its functions.
Figure 6 shows where we are targeting CK_iPS. The
vertical dimension plots each architecture by- its flexi-
f f _ t bility. Flexibility is difficult to quantify, because it in-
CNO CNl CN2 CN3 volves not only the range of algorithms that an
PNO PNl PN2 PN3 architecture can execute. but also the complexity of
CN4 CN5 CN6 CN7 the problems it can solve. (Greater complexity typi-
cally requires a larger range of operations. I As a re-
t t t
sult, this graph is subjective and provided only as an
t
illustration.
FIGURE 4 A simple two-layered neural network. In this example,
The horizontal dimension plots each architecture by
each PN emulates two network nodes. PNs emulate the first layer,
computing one connection each clock. Then, they sequentially place its performance/cost-or operations per second per
node output on the OutBus while emulating, in parallel, the second dollar. The values are expressed in a log scale due to
layer. the orders-of-magnitude difference between tradi-
tional microprocessors at the low end and highly cus-
tom, analog chips at the high end. Note the technology
barrier, defined by practical limits of current semicon-
et al., 1990 Figure 5) and the other with 16 PNs (the ductor manufacturing. No one can build past the bar-
CNAPS-1016). Both chips are implemented in a 0.8 rier: you can do only so much with a transistor; you
micron CMOS process. The 64-PN chip is a full cus- can put only so many of them on a chip; and you can
tom design and is approximately 26 mm on a side and run them only so fast.
has more than 14 million transistors, making it one of For pattern recognition, we placed the CN.\PS ar-
the largest processor chips ever made. The simple chitecture in the middle, between the specialized ana-
computational model makes possible a small, simple log chips and the general-purpose microprocessors. We
PN, in turn permitting the use of redundancy to im- wanted it to be programmable enough to so1L.e many
prove semiconductor yield for such a device. real-world problems, and yet have a performance/cost
The CSC is implemented using a gate array technol- about 100 times faster than the highest performance
ogy, using a lOO,OOO-gate die and is about 10 mm on RISC processors. The CNAPS applications discussed
a side. later show that we have provided sufficient flexibility
The next section reviews the various design deci- to solve complex problems.
sions and the reasons for making them. Some of the In determining the degree of function required, we
features described are unique to CNAPS; others apply must solve all or most of a targeted problem. This need
to any digital signal processor chip. results from Amdahl’s law, which states that system
17. Digital VLSI Architecture for Real-World Problems 341

FIGURE 5 The CNAPS PN array chip. There are 64 PNs with memory on each die.
The PN array chip is one of the largest processor chips ever made. It consists of 14
million transistors and is over 26 mm on a side. PN redundancy, there are 16 spare
PNs, is used to guarantee high yields.

performance depends mainly on the slowest compo-


nent. This law can be formalized as follows:
~ Technology Barrier
1
s= (1)
@pi * q> + @p,, * s/J
where S is the total system speed-up, op,is the fraction
of total operations in the part of the computation run
and DSPs
on the f&r chip, sr is the speedup the chip provides,
op,, is the fraction of total operations run on the host
CNAPS
computer without acceleration. Hence, as opf or sr get
large, S approaches l/op,,. Unfortunately, opf needs to
be close to one before any real system-level improve- Full Custom
ment occurs, as shown in the following example. Digital/Analog
\
Suppose there are two such support chips to choose c

from: the first can run 80% of the computation with Operations/Dollar
20X improvement on that 80%; the second can run FIGURE 6 Though subjective, this graph gives a rough indication
only 205. but runs that 20% 1000X faster. By Am- of the CNAPS market positioning. The vertical dimension measures
dahl’s law. the first chip speeds up the system by more the range of functionality of an architecture; the horizontal dimen-
sion measures the performance/cost in operations per second per
than 4005, whereas the second-and seemingly fas- dollar. The philosophy behind CNAPS is that by restricting func-
ter-chip speeds up the system by only 20%. So Am- tionality to pattern recognition, image processing, and neural net-
dahl tells us that flexibility is often better than raw work emulation, a larger performance/cost is possible than with
performance, especially if that performance results traditional machines (parallel or sequential).
342 Dan Harmerstrom

from limiting the range of operations performed by the tively evolving-even at the algorithm level. An
device. analog device cannot easily follow such changes. A
digital, programmable device can change algorithms
by changing software.
Digital Our major goal was to produce a commercial prod-
Much effort has been dedicated to building analog uct that would be flexible enough and provide suffi-
VLSI chips for ANNs. Analog chips have great ap- cient precision to cover a broad range of <omplex
peal, partly because they follow biological models problems. This goal dictated a digital design. because
more closely than digital chips. Analog chips also can digital could offer accurate precision and niush more
achieve higher functional density. Excellent papers re- flexibility than a typical CMOS analog impismenta-
porting research in this area include Mead (1989), Ak- tion. Digital also offered excellent performmce and
ers, Haghighi, and Rao (1990), Graf, Jackel, and the advantages of a standardized technolog!-.
Hubbard (1988), Holler, Tam, Castro, and Benson
(1989), and Alspector (1991). Also, see Morgan
(1990) for a good summary of digital neural network Limited, Fixed-Point Precision
emulation. In both analog and digital domains, an impcrrant de-
Analog ANN implementations have been primarily cision is choosing the arithmetic precision rscxired. In
academic or industrial research projects, however. analog, precision affects design complexit\ Jnd the
Only a few have found their way into the real world as amount of compensation circuitry required. In digital,
commercial products: getting an analog device to work it effects the number of wires available as \&?!I as the
in a laboratory is one thing; making it work over a size and complexity of memory, buses, and LYthmetic
wide range of voltages, temperatures, and user capa- units. Precision also affects the power dissiparion.
bilities is another. In general, analog chips require In the digital domain, a related decision involves
much more stringent operating conditions than digital floating-point versus fixed-point representaric’n. Float-
chips. They are also more difficult to design and, after ing-point numbers (Figure 7) consist of an _=uponent
implementation, less flexible. (usually 8 bits representing base 2 or base ;6~ and a
The semiconductor industry is heaviIy oriented to- mantissa (usually 23 bits). The exponent is 5s: so that
ward digital chips. Analog chips represent only a mi- the mantissa is always normalized; that is. :he most
nor part of the total output, reinforcing their secondary significant “ 1” of the data is in the most significant
position. There are, of course, successful analog parts, position of the field. Adding two floating-point num-
and there always will be, because some applications bers involves shifting at least one of the operands to
require analog’s higher functional density to achieve get the same exponent. Multiplying two floating-point
their cost and performance constraints, and those ap-
plications can tolerate analog’s limited flexibility.
Exponent Mantissa
Likewise, there will be successful products using ana-
log ANN chips. Analog parts will probably be used in
simple applications, or as a part of a larger system in
more complex applications.
I 8 bits
I
24 bits
I

This prediction follows primarily from the limited 32 tit Floating Point Word
flexibility of analog chips. They typically implement
FIGURE 7 A floating point number. A single-preckon. IEEE
one algorithm, hardwired into the chip. A hardwired
compatible floating point configuration is shown. The high order
algorithm is fine if it is truly stable and it is all you 8 bits constitute the exponent; the remaining 24 bits, the mantissa or
need. The field of ANN applications is still new, how- “fractional” part. Floating point numbers are usually normalized so
ever, So most complex implementations are still ac- that the mantissa has a 1 in the most significant position.
17. Digita’ VLSI Architecture for Real-World Problems 343
numbers involves separate arithmetic on both expo- to provide additional head-room for repeated multiply-
nents and mantissas. Both operations require postnor- accumulates. Another was to use 8-bit input and output
malizing shifts after the arithmetic operations. data buses, because most computations involve 8-bit
Floating point has several advantages. The primary data and 8- or 16-bit weights, and because busing ex-
advantage is dynamic range, which results from the ternal to the PN is expensive in silicon area.
separate exponent. Another is precision, due to the 24-
bit mantissas. The disadvantage to floating point is its
SIMD
cost in silicon area. Much circuitry is required to keep
track of both exponents and mantissas and to perform The next major decision was how to control the PNs.
pre- and postoperation shifting of the mantissa. This A computer can have one or more instruction streams
circuitr! is particularly complicated if high speed is and one or more data streams. Most computers are
required. single instruction, single data (SISD) computers.
Fixed-point numbers consist of a numeral (usually These have one control unit and one processor unit,
16 to 1’ bits) and a radix point (in base 2, the binary usually combined on one chip (a microprocessor). The
point 1. In fixed point, the programmer chooses the po- control unit fetches instructions from program mem-
sition of the radix point. This position is typically fixed ory and decodes them. It then sends data operations
for the <alculation, although it is possible to change such as add, subtract, or multiply to the processing
the radix point under software control by explicitly unit. Sequencing operations, such as branch, are exe-
shifting operands. For many applications needing only cuted by the control unit itself. The SISD computers
limited dynamic range and precision, fixed point is are serial, not parallel.
sufficient. It is also much cheaper than floating point Two major families of parallel computer architec-
becaus? it requires less silicon area. tures have evolved: multiple instruction, multiple data
After choosing a digital signal representation for (MIMD) and single instruction, multiple data (SIMD).
CNAPS. the next question was how to represent the MIMD computers have many processing units, each
numbers. Biological neurons are known to use rela- of which has its own control unit. Each control/
tive]!. i?w precision and to have a limited dynamic processing unit can operate in parallel, executing many
range. These characteristics strongly suggest that a instructions at once. Because the processors operate
digital computer for emulating ANN structures should independently, MIMD is the most powerful and flexi-
be able to employ limited precision fixed-point arith- ble parallel architecture. The independent, asynchro-
metic. This conjecture in turn suggests an opportunity nous processors also make MIMD the most difficult to
to simplify significantly the arithmetic units and to use, requiring complex processor synchronization.
provid? greater computational density. Fixed-point ar- The SIMD computers have many processors but
ithmetic also places the design near the desired point only one instruction stream. All processors receive the
on the flexibility versus performance/cost curve (Fig- same instruction at the same time, but each acts on its
ure 6). own slice of the data. SIMD computers thus have an
To confirm the supposition that fixed point is ade- array of processors and can perform an operation on a
quate. we performed extensive simulations. We found block of data in one step. SIMD computing is often
that for the target applications, 8- or 16-bit fixed-point called “data parallel” computing, because it applies
precision was sufficient (Baker & Hammerstrom, one control thread to multiple local data elements, ex-
1989). Other researchers have since reached the same ecuting one instruction at each clock.
conclusion (Hoehfeld and Fahlman, 1992; Shoemaker, SIMD computation is perfect for vector and matrix
Carlin. 6: Shimabukuro, 1990). In keeping with exper- arithmetic. Because of Amdahl’s law, however, SIMD
imental results, we used a general 16-bit resolution is cost-effective only if most operations are matrix or
inside the PN. One exception was to use a 32-bit adder vector operations. For general-purpose computing,
344 Dan Hmmerstrom

this is not the case. Consequently, SIMD machines are each column of the matrix on each PN, it takes n op-
poor general-purpose computers and rarer than SISD erations on n processors.
or even MIMD computers. Our target domain is not In sum, SIMD was better than MIMD for CNAPS
general-purpose computing, however. For ANNs and because it fit the problem domain, was much more
other image and signal processing algorithms, the economical, and easier to program.
dominant calculations are vector or matrix operations.
SIMD fits this domain perfectly.
Broadcast Interconnect
The SIMD architecture is a good choice for practical
reasons, too. One advantage is cost: SIMD is much The next decision concerned how to interconnect the
cheaper than MIMD, because there is only one control PNs for data transfer, both within the arra! 3nd out-
unit for the entire array of processors. Another is that side it. Computer architects have develops2 several
SIMD is easier to program than MIMD, because all interconnect structures for connecting proczrsors in
processors do the same thing at the same time. Like- multiprocessor systems. Because CNAPS is 1 SIMD
wise, it is easier to develop computer languages for machine, we were interested only in sync:hronous
SIMD, because it is relatively easy to develop parallel structures.
data structures where the data are operated on simul- The two families of interconnect structure- me local
taneously. Figure 8 shows a simple CNAPS-C pro- and global. Local interconnect attaches onl! zzighbor-
gram that multiplies a vector times a matrix. Normally, ing PNs. The most common local scheme :s NEWS
vector matrix multiply takes n* operations. By placing (North-East-West-South, Figure 9). In NEWS. :he PNs
are laid out in a two-dimensional array, and ?xh PN
is connected to its four nearest neighbors. A one-
# define N 20
#define K 30
typedef scaled 8 8 arithType;
domain Krows
{arithType sourceMatrix[N];
arithType resultVector;} dimK[K];

main0
{ int n;
[domain dimK].(
resultvector = 0;
for (n=O; n c N; n++)
resultvector += sourceMatrix[n] l getchar();

]
FIGURE 8 A CNAPS-C program to do a simple vector-matrix FIGURE 9 A two-dimensional PN layout. This cont;uration is
multiply. The “data-parallel” programming is evident here. Within often called a “NEWS” network, because each PN corrects to its
the loop, it is assumed because of the domain declaration that there north, east, west, and south neighbor. These networks pi?vide more
are multiple copies of each matrix element, one on each PN. The flexible intercommunication than a one-dimensional ~~.voork, but
program takes N loop iterations, which would require Nz on a se- are more expensive to implement in VLSI and diffic2 to make
quential machine. work when redundant PNs are used.
17. Digita VLSI Architecture for Real-World Problems 345
dimensional variation connects each PN only to its left time on the bus. Because the time is known, there is
and right neighbors. no need to send an address with each data element,
Global interconnect permits any PN to talk to any saving wires, clocks, or both. Also, the weight address
other PN. not just to .its immediate neighbors. There unit can “remember” the slot number and use it to
are several possible configurations with different lev- address the weight associated with the connection.
els of performance/cost. At one end of the scale, CTOSS- A single broadcast bus is simple, economical to im-
bar interconnect is versatile because it permits random plement, and efficient for the application domain. In
point-to-point communications, but expensive [the fact, if every PN always communicates with every
cost is O!n’), where n is the number of PNs]. At the other PN, then broadcast offers the best possible
other end. hroadcasr interconnect is cheaper but less performance/cost.
flexible. Here, one bus connects all PNs, so any one Broadcast interconnection has several drawbacks.
PN can talk to any other (or set of others) in one clock. One problem is its inefficiency for some point-to-point
On the other hand, it takes n clocks for all PNs to have communication patterns, in which one PN talks with
a turn. The cost is O(1). In between crossbar and one other PN anywhere in the array. An example of
broadcast are other configurations that can emulate a such a pattern is the “perfect shuffle” used by the fast
cross-bar in O(logn) clocks and have cost O(nlogn). Fourier transform (FFT, Figure 10). This pattern takes
Choosing an interconnect structure interacted with n clocks on the CNAPS broadcast bus and is too slow
other design choices. We decided against using a sq’s- to be effective. Consequently, CNAPS implements the
folio computing style. in which operands, intermediate compute-intensive discrete Fourier transform (DFT)
results. or both flow down a row of PNs using only instead of the communication-intensive FFT. The DFT
local inrer<onnect. Systolic arrays are harder to pro- requires O(n*) operations; the FFT, O(nlogn). If n = p,
gram. Thz>’ are also occasionally inefficient because
of the cl<~ks needed to fill or empty the pipeline-
peak effi%rcy occurs only when all PNs see all oper-
ands. Chcoosing a systolic array would have permitted
us to us= local interconnect, saving cost. Deciding
against i: forced us to provide some form of global
interconn23.
Choosing “global” leads to the next choice: what
type? Th: basic computations in our target applica-
tions required “one-to-many” or “many-to-many”
communization almost exclusively. We therefore de-
cided to use a broadcast bus, which uses only one
clock for one-to-many communication. In the many-
to-many’ zse, n PNs can talk to all y1 PNs in y1 clocks.
Broadcast interconnect thus allows y2* connections in y2
clocks. Such O(n2) total connectivity occurs often in
our apph<arions. An example is a back-propagation
network in which all nodes in one layer connect to all
Time -
nodes in the next.
FIGURE 10 The intercommunication pattern of a fast Fourier
Another advantage is that broadcast interconnection
transform (FFT) . A butterfly intercommunication pattern for four
is synchronous and fits the synchronous SIMD struc- nodes. This pattern is difficult for CNAPS to do in less than N clocks
:ure quite well. We were able to use a “slotted” pro- (where N is the number of nodes) with broadcast intercommu-
:ocol, in lvhich each connection occurs at a known nication.
346 Dan Hammerstrom

where p is the number of PNs, then CNAPS can per- connections. The Virtual Zero technique does not help
form a DFT in O(n) clocks. If n > p, then performance the idle PN problem, however. Full efficiency with
can approach the O(nlogn) of a sequential processor. sparse interconnect requires a much more complex ar-
Another problem involves computation localized in chitecture, including more individualized control per
a portion of an input vector, where each PN operates PN, more complex memory-referencing capabilities,
on a different (possibly overlapping) subset of the el- and so on, and is beyond the scope of this chapter.
ements. Here, all PNs must wait for all inputs to be
broadcast before any computation can begin. A com-
On-Chip Memory
mon example of this situation is the limited receptive
field structure, often found in image classification One of the most difficult decisions was whether to
and character recognition networks. The convolution place the local memory on-chip inside the PN or off-
operation, also common in image processing, uses chip. Both approaches have advantages and draw-
similar localized computation. The convolution can backs-it was a complex decision with no obvious
proceed rapidly after some portion of the image has right answer and little opportunity for compromise.
been input into each PN, because each PN operates The major advantage of off-chip memory is that it
independently on its subset of the image. allows essentially unlimited memory per PN. Placing
When these subfields overlap (such as is in convo- memory inside the PN, in contrast, limits the available
lution), a PN must communicate with its neighbors. To memory because memory takes significant silicon
improve performance for such cases, we added a one- area. Increasing PN size also limits the number of PNs.
dimensional inter-PN pathway, connecting each PN to Another advantage to off-chip memory is that it allows
its right and left neighbors. (One dimension was cho- the use of relatively low-cost commercial memory
sen over two to allow processor redundancy, discussed chips. On-chip memory, in contrast, increases the cost
later). The CNAPS array therefore has both global per bit-even if the memory employs a commercial
(broadcast) and local (inter-PN) interconnection. An memory cell.
example of using the inter-PN pathway might be im- The major advantage of on-chip memory is that it
age processing, where a column of each image is allo- allows much higher bandwidth for memory access. To
cated to each PN. The inter-PN pathway permits see that bandwidth is a crucial factor, consider the fol-
efficient communication between columns-and, lowing analysis. Recall that each PN has its on-n data
consequently, efficient computation for most image- arithmetic units, therefore each PN requires a unique
processing algorithms. memory data stream. The CNAPS-1064 has 61 PNs,
A final problem is sparse random interconnect, each potentially requiring up to 2 bytes per clock. At
where each node connects to some random subset of 25 MHz, that is 25M * 64 * 2 = 3.2 billion by-tss/sec.
other nodes. Broadcast, from the viewpoint of the con- Attaining 3.2 billion bytes/set from off-chip memory
nected PNs, is in this case efficient. Nonetheless, when is difficult and expensive because of the limits on the
a slotted protocol is used, many PNs are idle because number of pins per chip and the data rate per pin.’An
they lack weights connected to the current input and option would be to reduce the number of PNs per chip,
do not need the data being broadcast. Sparse intercon- eroding the benefit of maximum parallelism.
nect affects all aspects of the architecture, not just data Another advantage to on-chip memory is that each
communication. To improve efficiency for sparsely PN can address different locations in memory in each
connected networks, the CNAPS PN offers a special
‘For most implementations, the bit rate per pin is roughly equal
memory technique called virtual zero, which saves to the clock rate, which can vary anywhere from 25 to 700 MHZ.
memory locations that would otherwise be filled with There are some special interface protocols which now allow up to
zeros by not loading zeros into memory for unused 500 Mbitdsec per pin.
17. Digital VLSI Architecture for Real-World Problems 347
clock. Systems with off-chip memory, in contrast, typ- cause of the bandwidth needs and because we had ac-
ically require all PNs to address the same location for cess to a commercial density static RAM CMOS
each memory reference to reduce the number of exter- process, we decided to implement PN memory on
nal output pins for memory addressing. With a shared chip, inside the PN. Each PN has 4 KB of static RAM
address only a single set of address pins is required for in the current 1064 and 1016 chips.
an entire PN array. Allowing each PN to have unique CNAPS is the only architecture for ANN applica-
memory addresses, requires a set of address pins for tions we are aware of that uses on-chip memory. Sev-
each PN. which is expensive. Yet, having each PN eral designs have been proposed that use off-chip
address its own local memory improves versatility and mem-ory. The CNS system being developed at Berke-
speed, because table lookup, string operations, and ley (Wawrzyneck, Asanovic, & Morgan, 1993), for
other kinds of “indirect” reference are possible. instance, restricts the number of PNs to 16 per chip. It
Another advantage is that the total system is simpler. also uses a special high-speed PN-to-memory bus to
On-chip memory makes it possible to create a com- achieve the necessary bandwidth. Another system, de-
plete system with little more than one sequencer chip, veloped by Ramacher at Siemens (Ramacher et al.,
one PN array chip, and some external RAM or ROM 1993) uses a special systolic pipeline that reduces the
for the sequencer program. (Program memory needs number of fetches required by forcing each memory
less bandwidth than PN memory because SIMD ma- fetch to be used several times. This organization is
chines access it serially, one instruction per clock.) efficient at doing inner products, but has restricted
It is possible to place a cache in each PN, then use flexibility. HNC has also created a SIMD array called
off-chip memory as a backing store, which attempts to the SNAP (Means & Lisenbee, 1991). It uses floating-
gain the benefits of both on-chip and off-chip memory point arithmetic, reducing the number of PNs on a
by using aspects of both designs. Our simulations on chip to only four-in turn, reducing the bandwidth
this point verified what most people who work in requirements.
ANNs already suspected: Caching is ineffective for The major problem with on-chip memory is its lim-
ANNs because of the nonlocality of the memory ref- ited memory capacity. Although this limitation does
erence streams. Caches are effective if the processor restrict CNAPS applications somewhat, it has not
repeatedi! accesses a small set of memory locations, been a major problem. With early applications, the
called a il.orking set. Pattern recognition and signal performance/cost advantages of on-chip memory have
processing programs rarely exhibit that kind of behav- been more important than the memory capacity limits.
ior: instead. they reference long, sequential vector
arrays.
Redundancy for Yield Improvement
Separate PN memory addressing also reduces the
benefit of caching. Unless all PNs refer to the same During the manufacture of integrated circuits, small
address. some PNs can have a cache miss and others defects and other anomalies occur, causing some cir-
not. If the probability of a cache miss is 10% per PN, cuits to malfunction. These defects have a more or less
then a 25PN array will most likely have a cache miss random distribution on a silicon wafer. The larger the
every clock. But because of the synchronous SIMD chip, the greater the probability that at least one defect
control, all PNs must wait for the one or more PNs that will occur there during manufacturing. The number of
miss the cache. This behavior renders the cache use- good chips per wafer is called the yield. As chips get
less. A MI54D structure overcomes the problem, but larger, fewer chips fit on a wafer and more have de-
increases system complexity and cost. fects, therefore, yield drops off rapidly with size. Be-
As this discussion suggests, local PN memory is a cause wafer costs are fixed, cost per chip is directly
complex topic with no easy answers. Primarily be- related to the number of good chips per wafer. The
348 Dan Hammerstrom

result is that bigger chips cost more. On the other hand, cause it was adequate for our applications and does not
bigger chips do more, and their ability to fit more func- impact the PN redundancy mechanisms.
tion into a smaller system makes big chips worth more.
Semiconductor engineers are constantly pushing the
limits to maximize both function and yield at the same Limitations
time.
In retrospect, we are satisfied with the decisions made
One way to build larger chips and maximize yield
in designing the CNAPS architecture. We have no re-
is to use redundancy, where many copies of a circuit
grets about the major decisions such as the choices of
are built into the chip. After fabrication, defective cir-
digital, SIMD, limited fixed point. broadcast intsrcon-
cuits are switched out and replaced with a good copy.
nect, and on-chip memory.
Memory designers have used redundancy for years;
The architecture does have a few minor bonlsnecks
where extra memory words are fabricated on the chip
that will be alleviated in future versions. For example.
and substituted for defective words. With redundancy,
the g-bit input/output buses should be 16-bit. In line
some defects can be tolerated and still yield a fully
with that, a true one-clock 16 X 16 multiply is needed.
functional chip.
as well as better support for rounding. And future ver-
One advantage of building ANN silicon is that each
sions will have higher frequencies and more pn-chip
PN can be simple and small. In the CNAPS processor
memory. The one-dimensional inter-PN bus is 3. bits.
array chip, the PNs are small enough to be effective as
it should be 16 bits. Despite these few limitaricns. the
“units of redundancy.” By fabricating spare PNs, we
architecture has been successfully applied to rsveral
can significantly improve yield and reduce cost per
applications with excellent performance.
PN. The 1064 has 80 PNs (in an 8 X 10 array), and
the 1016 has 20 (4 X 5). Even with a relatively high
defect density, the probability of at least 64 out of 80 Product Realization and Software
(or 16 out of 20) PNs being fully functional is close to
1 .O. CNAPS is the first commercial processor to make Adaptive Solutions has created a complete dsk-clap-
extensive use of such redundancy to reduce costs. ment software package for CNAPS. It includss a li-
Without redundancy, the processor array chips would brary of important ANN algorithms and a C compiler
have been smaller and less cost-effective. We estimate with a library of commonly used functions.’ Several
a CNAPS implementation using redundancy has about board products are now available and sold to custom-
a two-times performance/cost advantage over one ers to use for ANN emulation, image and signal pro-
lacking redundancy. cessing, and pattern recognition applications.
Redundancy also influenced the decision to use lim-
ited-precision, fixed-point arithmetic. Our analyses
showed that floating-point PNs would have been too
CNAPS APPLICATIONS
large to leverage redundancy; hence, floating point
would have been even more expensive than just the
size difference (normally about a factor of four) indi- This section reviews several CNAPS applications. Be-
cates. Redundancy also influenced the decision to use cause of the nature of this book its focus is on XNX
one-dimensional inter-PN interconnect. One-dimen- applications, although CNAPS has also been used for
sional interconnect makes it relatively easy to imple- non-ANN applications such as image processing.
ment PN redundancy, because any 64 of the 80 PNs Some applications mix ANN and non-ANS tech-
can be used. Two-dimensional interconnect compli- niques. For example, an application could preprocess
cates redundancy and was not essential for our appli- and enhance an image via standard imaging algo-
cations. We chose one-dimensional interconnect, be- *CNAPS-C is a data parallel version of the standard C lqua~s.
17. Digital VLSI Architecture for Real-World Problems 349
rithms. then use an ANN classifier on segments of the most curve-fitting problems, such as function predic-
image. keeping all data inside the CNAPS array for all tion, which have more stringent accuracy require-
operations.9 A discussion of the full range of CNAPS’s ments. In those cases in which BP16 does not have the
capabilities is beyond the scope of this paper. For a accuracy of floating point, BP32 is as accurate as float-
detailed discussion of CNAPS in signal processing, ing point in all cases studied so far. The rest of this
see Skinner, 1994. section focuses on the BP16 algorithm. It does not
discuss the techniques involved in dealing with limited
precision on CNAPS.
Back-Propagation Back-propagation has two phases. The first is feed-
forward operation, in which the network passes data
The most popular ANN algorithm is back-propagation
without updating weights. The second is error back-
(BP; Rumelhart & McClellan, 1986). Although it
propagation and weight update during training. Each
requires large computational resources during training,
phase will be discussed separately. This discussion as-
BP has several advantages that make it a valuable
sumes that the reader already has a working under-
algorithm:
standing of BP
l it is reasonably generic, meaning that one network
model (emulation program) can be applied to a wide
range of applications with little or no modification; Back-Propagation: Feed-Forward Phase
l its nonlinear, multilayer architecture lets it solve
complex problems: Assume a simple CNAPS system with four PNs and a
l it is relatively easy to use and understand; and BP network with five inputs, four hidden nodes, and
l several commercial software vendors have excellent two output nodes (34 total connections, counting a
BP implementations. separate bias parameter for each node; Figure 11).

It is estimated that more than 90% of the ANN ap-


plications in use today use BP or some variant of it.
We therefore felt that it was important for CNAPS to
execute BP efficiently. This section briefly discusses
the general implementation of BP on CNAPS. For
more detail, see McCartor (1991).
There are two CNAPS implementations of BP a
single-precision version (BP1 6) and a double-preci-
sion version (BP32). BP16 uses unsigned g-bit input
and output values and signed 16-bit weights. The ac-
tivation function is a traditional sigmoid, implemented
by table lookup. BP32 uses signed 16-bit input and
output values and signed 32-bit weights. The activa-
tion function is a hyperbolic tangent implemented by
table lookup for the upper 8 bits and by linear extrap-
olation for the lower 8 bits. All values are fixed point.
We have found that BP16 is sufficient for all classifi-
zation problems. BP16 has also been sufficient for
9To change algorithms. the CSC need only branch to a different FIGURE 11 A back-propagation network with five inputs, four
section of a program. hidden nodes, and two output nodes.
350 Dan Hammerstrom

Allocate nodes 0 and 4 to PNO, nodes 1 and 5 to PN 1, have computed all products. One more clock is needed
node 2 to PN2, and node 3 to PN3. When a node is for the last addition; then, a sigmoid table lookup is
allocated to a PN, the local memory of that PN is performed. Finally, the node 4 and 5 outputs are trans-
loaded with the weight values for each of the node’s mitted sequentially on the Outbus, and the CSC writes
connections and with the lookup table for the sig- them into a file.
moid function. If learning is to be performed, then Let a connection clock be the time it takes to com-
each connection requires a 2-byte weight plus 2 bytes pute one connection. For standard BP a connection
to sum the weight deltas, and a 2-byte transpose requires a multiply-accumulate plus, depending on the
weight (discussed below). This network then requires architecture, a memory fetch of the next weight, the
204 bytes for connection information. Using momen- computation of that weight’s address, and so on. For
tum-ignored here for simplicity-would require the CNAPS PN, a connection clock takes one cycle.
more bytes per connection. On a commercial microprocessor chip, a connection
Each input vector contains five elements. The clock can require one or more cycles, because many
weight index notation is WA,,, where A is the layer commercial chips cannot simultaneously execute all
index, in our example, 0 for input, 1 for hidden, and 2 operations required to compute a connection clock:
for output. B indexes the node in the layer, and C in- weight fetch, weight address increment, input element
dexes the weight in the node. To start the emulation fetch, multiply, and accumulate. These operations can
process, each element of the input vector is read from take up to 10 clocks on many microprocessors. Much
an external file by the CSC and broadcast over the of this overhead is memory fetching, because many
Inbus to all four PNs. PNO performs the multiply v0 * state-of-the-art microprocessors are making more use
~1,; PNl, v0 * wl,,; and so on. This happens in one of several levels of intermediate data caching. How-
clock. In the next clock, v, is broadcast, PNO computes ever, as discussed previously, ANNs are notorious
VI * wl,,; PNl, v, * WI,,; and so on. Meanwhile, the cache busters, so many memory and input element
previous clock’s products are sent to the adder, which fetches can take several clocks each.
initially contains zero. Simulating a three-layer BP network with.f’. inputs,
All of the hidden-layer products have been gener- NH nodes in the hidden layer, and N, nodes in the
ated after five clocks. One more clock is required to output layer will require (N, * NH) * (N, * _\.,) + N,
add the last product to the accumulating sum (ignoring connection clocks for nonlearning, feed-foru-ard op-
the bias terms here for simplicity). Next, all PNs ex- eration on a single-processor system. On CS.\PS. as-
tract the most significant byte out of the product and suming there are more PNs than hidden or output
use it as an address into the lookup table to get the nodes, the same network will require N, + _\e.q + N,
sigmoid output. The read value then is put into the out- connection clocks. For example, assume that N, =
put buffer, and the PNs are ready to compute the output 256, NH = 128, and N, = 64. For a single processor
node outputs. system, the total is 73,792 connection clocks: for
The next step is computing the output-layer node CNAPS, 448. If a workstation takes about four cycles
values (nodes 4 and 5). In the first clock, PNO trans- on average per connection, which is typical. to com-
mits its output (node O’s output) onto the output bus. pute a connection, then CNAPS is about 600X faster
This value goes through the CSC and comes out on the on this network.
Inbus, where it is broadcast to all PNs. Although only
IWO and PNl are used, all PNs compute values (PN2
Back-Propagation: Learning Phase
and PN3 compute dummy values). PNO and PN 1 com-
pute n, * ~2, and n, * w2,,. In the next clock, node The second and more complex aspect of BP learning
l’s value is broadcast and n, * w2,, and n, * w2,, are is computing the weight delta for each connection. A
computed, and so on. After four clocks, PNO and PNl detailed discussion of this computation and its CNAPS
17. Digital VLSI Architecture for Real-World Problems 351
implementation is beyond the scope of this chapter, so weight value need only be computed once, it must
only a brief overview is given here. The computation be written to two places. This duplicate transpose
is more or less the same as a sequential implementa- weight matrix is required only if learning is to be
tion. The basic learning operation in BP is to compute performed.
an error signal for each node. The error signal is pro- After the hidden-layer error signals have been com-
portional to that node’s contribution to the output error puted, the weight delta computation can proceed ex-
(the difference between the target output vector and actly as previously described. If more than one hidden
the actual output error). From the error signal, a node layer is used, then the entire process is repeated for the
can then compute how to update its weights. At the second hidden layer. The input layer does not require
output layer, the error signal is the difference between the error signal.
the feed-forward output vector and the target output For nonbatched weight update, in which the weights
vector for that training vector. The output nodes can are updated after the presentation of each vector, the
compute their error signals in parallel. learning overhead requires about five times more cy-
The next step is to compute the delta for each output cles than feed-forward execution. A 256-PN (four-
node’s input weight (the hidden-to-output weights). chip) system with all PNs busy can update about one
This computation can be done in parallel, with each billion connections per second, almost one thousand
node computing, sequentially, the deltas for all times faster than a Sparc2 workstation. A BP network
weights of the output node on this PN. If a batching that takes an hour on a Sparc2 takes only a few seconds
algorithm is used, then the deltas are added to a data on CNAPS.
element associated with each weight. After several
weight updates have been computed, the weights are
Simple Image Processing
updated according to an accumulated delta.
The next step is to compute the error signals for One major goal of CNAPS was flexibility because, by
the hidden-layer nodes, which requires a multiply- Amdahl’s law, the more the problem can be parallel-
accumulate of the output-node error signals through ized the better; therefore, other parallelizable, but non-
the output-node weights. Unfortunately, the output- ANN, parts of the problem Should also be moved to
layer w.eights are in the wrong place (on the output CNAPS where possible. Many imaging applications,
PNs) for computing the hidden-layer errors; that is, the including OCR programs, require image processing
bidder, nodes need weights that are scattered among before turning the ANN classifier loose on the data. A
the output PNs, which can best be represented as a common image-processing operation is convolution
transpose of the weight matrix for that layer. In other by spatial filtering.
words. a row of the forward weight matrix is allo- Using spatial (pixel) filters to enhance an image re-
cated to each PN. When propagating the error back to quires more complex computations than simple pixel
the hidden layer, the inner product uses the column operations require. Convolution, for example, is a
of the same matrix which is spread across PNs. A common operation performed during feature extrac-
transpose of the weight matrix makes these columns tion to filter noise or define edges. Here, a kernel, an
into rows and allows efficient matrix-vector opera- M by M dimensional matrix, is convolved over an im-
tions, A transpose operation is slow on CNAPS, tak- age. In the following equation, for instance, the local
ing 0(h3) operations. The easiest solution was to kernel k is convolved over an N by N image a to pro-
maintain two weight matrices for each layer, the feed- duce a filtered N by N image b:
forward version and a transposed version for the er-
6, = CP.4 kp.qar - *., - q (2)
ror back-propagation. This requires twice the weight
memory for each hidden node, but permits error prop-
(i I i,j i N)(l ‘p,q 5 M)
agation to be parallel, not serial. Although the new
352 Dan Hammerstrom

Typical convolution kernels are Gaussian, differ- cubic inches. Available power is measured in tens of
ences-of-Gaussian, and Laplacian filters. Because of watts. Such immense demands have driven N‘\WC re-
their inherent parallelism, convolution algorithms can searchers toward ANN technology.
be easily mapped to the CNAPS architecture. The im- For some time (1986 to 199 1). many believed that
age to be filtered is divided into regions of “tiles,” analog hardware was the only way to achieve the re-
and each region is then subdivided into columns of quired computational density. The emergence of wafer
pixel data. The CNAPS array processes the image scale, parallel digital processing (exemplified by the
one row at a time. Pixels from adjacent columns are CNAPS chip) has changed that assessment, however.
transferred between neighboring PNs through the in- With this chip, we have crossed the threshold at which
ter-PN bus. A series of (M - 1)/2 transfers in each digital hardware-with all its attendant flexibility ad-
direction is made so that each PN can store all the vantages-has the computational density needed to be
image data needed for the local calculation. Once the useful in the tactical missile environment. .\nalog
PN has in local memory all the pixels in the “sup- VLSI may still be the only way to overcome some of
port” for the convolution being computed, the kernel, the most acute time-critical processing problems on
k, is broadcast simultaneously to all PNs. This kernel board the missile, for example. at the front end of an
can come from external data memory, or be sequen- image-processing system. A hybrid system combining
tially from M PNs. The actual computation is just our the best of both types of chips may easily turn out to
familiar inner-product. be the best solution.
Because of the parallel structure of this algorithm, Researchers at NAWC have worked with several
all PNs can calculate the convolution kernel at the versions of the CNAPS system. They have easily im-
same time, convolving all pixels in one row simulta- plemented cortico-morphic computational structures
neously. Using different kernels, this convolution on this system-structures that were difficult or im-
process can be carried out several times, each time possible under the analog constraints of previous sys-
with a different type of spatial filtering performed on tems. They have also worked with Adaptive Solutions
the image. to design and implement a multiple-controller CYAPS
For a 5 12 X 5 12 image and 5 12 PNs (one column system (a multiple SIMD architecture or h1SIMD)
allocated per PN), a 3 X 3 kernel can be convolved with high-speed, data-transfer paths between the sub-
over all pixels in 1.6 msec, assuming the image is al- systems, and they are completing the design and fab-
ready loaded. A 7 X 7 kernel requires 9.6 msec. rication of a real-time system interfaced to actual
missile hardware. The current iteration will be of the
SIMD form, but the follow-on will have the new
Naval Air Warfare Center
MSIMD structure.
At the Naval Air Warfare Center (NAWC) at China Because of the nature of the work at NAW-C. spe-
Lake, California, ANN technology has been aimed at cific results cannot be discussed here. Some general
air-launched tactical missiles. Processing sensor infor- ideas merit mention, however. Standard image-pro-
mation on board these missiles demands a compu- cessing techniques typically only deal with spatial de-
tational density (operations per second per cubic inch) tail, examining a single frame of the image in discrete
far above most commercial applications. Tactical mis- time. One advantage to the cortico-morphic techniques
siles typically have several high-data-rate sensors, developed by NAWC is that they incorporate the tem-
each with its own separate requirements for high- poral aspects of the signal into the classification pro-
speed processing. The separate data must then be cess. In target tracking and recognition applications,
fused, and the physical operation of the missile con- temporal information is at least as important as spatial
trolled. All this must be done under millisecond or information. The cortico-morphic processing para-
microsecond time constraints and in a volume of a few digm, as implemented on the CNAPS architecture, al-
17. Digital VLSI Architecture for Real-World Problems 353
lows sequential processing of patches of data in real Sharp Kanji
time. similar to the processing in the vertebrate retina
and cortex. Another application that has successfully used ANNs
One important near-term application of this compu- and the CNAPS system is a Kanji optical character
tational structure is in the area of adaptive, nonuni- recognition (OCR) system developed by the Sharp
formity compensation for staring focal plane arrays. It Corporation of Japan. In OCR, a page of printed text
appears also that this structure will allow the imple- is scanned to produce a bit pattern of the entire image.
mentation of three-dimensional wavelet transforms The OCR program’s task is to convert the bit pattern
where the third dimension is time. of each character into a computer representation of the
character. In the United States and Europe, the most
common representation of Latin characters is the &bit
Lynch/Granger Pyriform
ASCII code. In Japan, because of their unique writing
Implementation
system, it is the 16-bit JIS code.
Researchers Gary Lynch and Richard Granger (Granger The OCR system requires a complex set of image
et al.. this volume) at the University of California, recognition operations. Many companies have found
Irvin?. have produced an ANN model based on their that ANNs are effective for OCR because ANNs are
studies of the pyriform cortex of the rat. The algo- powerful classifiers. Many commercial OCR compa-
rithm contains features abstracted from actual bio- nies, such as Caere, Calera, Expervision, and Mimet-
logical operations. and has been implemented on the its, use ANN classifiers as a part of their software.
CNAPS parallel computer (Means & Hammerstrom, Japanese OCR is much more difficult than English
1991 I. OCR because Japanese has a larger character set. Writ-
Ths algorithm contains both parallel and serial ele- ten Japanese has two basic alphabets. The first is
ments. and lends itself well to execution on CNAPS. Kanji, or pictorial characters borrowed from China.
Clusters of competing neurons, called patches or sub- Japanese has tens of thousands of Kanji characters,
r?ers. hierarchically classify inputs by first competing although it is possible to manage reasonably well with
for the greatest activation within each patch, then sub- about 3500 characters. Sharp chose these basic Kanji
tracting the most prominent features from the input as characters for their recognizer.
it procseds down the lateral olfactory tract (LOT, the The second alphabet is Kana, composed of two pho-
primar>- input channel) to subsequent patches. Patch netic alphabets (hiragana and katakana) having 53
acti\larion and competition occur in parallel in the characters each. Typical written Japanese mixes Kanji
CNAPS implementation. A renormalization function and Kana. Written Japanese also employs arabic nu-
analogous to the automatic gain control performed in merals and Latin characters also found in business and
pyriform cortex also occurs in parallel across compet- newspaper writing. A commercial OCR system must
ing PNs in the CNAPS array. be able to identify all four types of characters. To add
Transmission of LOT input from patch to patch is further complexity, any character can appear in several
an inherently serial element of the pyriform model, so different fonts.
opportunities for parallel execution for this part of the Japanese keyboards are difficult to use, so a much
model are few. Nevertheless, overall speedups for ex- smaller proportion of business documentation than
ecution on CNAPS (compared to execution on a serial one sees in the United States and other western coun-
machine) of 50 to 200 times are possible, depending tries is in a computer readable form. This difficulty
on network dimensions. creates a great demand for the ability to read accu-
Refinements of the pyrifonn model and applica- rately printed Japanese text and to convert it to the
tions of it to diverse pattern recognition applications corresponding JIS code automatically. Unfortunately,
continue. because of the large alphabet, computer recognition of
354 Dan Hammerstrom

written Japanese is a daunting task. At the time this use of subcategories let Sharp build and train several
chapter is being written, the commercial market con- small networks instead of one large network. Each
sists of slow (10-50 characterslsec), expensive (tens small network took its input from several local recep-
of thousands of dollars), and marginally accurate tive fields designed to look for particular features. The
(96%) systems. Providing high speed and accuracy for locations of these fields were chosen automatically
a reasonable price would be a quantum leap in capa- during training to maximize discriminative informa-
bility in the current market. tion. The target features are applied to several posi-
Sharp Corporation and Mitsubishi Electric Corpo- tions within each receptive field, enhancing the shift
ration have both built prototype Japanese recognition tolerance of the field.
systems based on the CNAPS architecture. Both sys- On a database of scanned characters that included
tems recognize a total of about 4000 characters in 15 more than 26 fonts, Sharp reported an accuracy of
or more different fonts at accuracies of more than 99% 99.92% on the I3 fonts used for training and 99.01%
and speeds of several hundred characters per second. accuracy on characters on the 13 fonts used for testing.
These applications have not yet been released as com- These results show the generalization capabilities of
mercial products, but both companies have announced this network.
intentions to do so.
Sharp’s system uses a hierarchical three-layer net-
work (Hammerstrom, 1993; Togawa, Ueda, Aramaki,
& Tanaka, 199 1; Figures 12 and 13). Each layer is NON-CNAPS APPLICATIONS
based on the Kohonen’s Learning Vector Quantization
(LVQ), a Bayesian approximation algorithm that shifts This section discusses two applications that do not use
the node boundaries to maximize the number of cor- CNAPS (although they could easily use the CNAPS
rect classifications. In Sharp’s system, unlike back- BP implementation).
propagation, each hidden-layer node represents a
character class, and some classes are assigned to sev-
eral nodes. Ambiguous characters pass to the next x
layer. When any layer unambiguously classifies a char-
acter, it has been identified, and the system moves on
to the next character.
The first two levels take as input a 16 X 16 pixel
image (256 elements) (Figure 12). With some excep-
tions, these layers classify the character into multiple
subcategories. The third level has a separate network
Stage 1
per subcategory (Figure 13). It uses a high-resolution
32 X 32 pixel image (1024 elements), focusing on the
subareas of the image known to have the greatest dif-
ferences among characters belonging to the subcate-
gory. These subareas of the image are trained to
tolerate reasonable spatial shifting without sacrificing
accuracy. Such shift tolerance is essential because of
FIGURE 12 A schematicized version of the three-layer LVQ net-
the differences among fonts and shifting during
work that Sharp uses in their Kanji OCR system. The character is
scanning. presented as a 16 X 16 or 256-element system. Some characters are
Sharp’s engineers clustered 3303 characters into recognized immediately; others are merely grouped with similar
893 subcategories containing similar characters. The characters.
17. Digital VLSI Architecture for Real-World Problems 355

FIGURE 13 Distinguishing members of a group by focusing on a group-specific subfield.


Here. a more detailed 32 X 32 image is used (Togawa et al., 1991).

Nippon Steel alone, accurate descriptions of highly complex, non-


ANNs are starting to make a difference in process con- linear processes. After the network describes the pro-
trol for manufacturing. In many commercial environ- cess, it can be used to help control it. Another
ments, controlling a complex process can be beyond technique is to use two networks, where one models
the best adaptive control systems or rule-based expert the process to be controlled and the other the inverse
systems. One reason for this is that many natural pro- control model. An inverse network takes as input the
cesses are strongly nonlinear. Most adaptive control desired state and returns the control values that place
theory, on the other hand, assumes linearity. Further- the process in that state.
more, many processes are so complex that there is no There are many examples of using ANNs for indus-
concise mathematical description of the process, just trial process control. This section describes an appli-
large amounts of data. cation in the steel industry, developed jointly by
Working with such data is the province of ANNs, Fujitsu Ltd., Kawasaki, and Nippon Steel, Kitakyu-
because they have been shown to extract, from data shu-shi, Japan. The technique is more effective than
356 Dan Hammerstrom

any previous technique and has reduced costs by sev- new system has been in actual operation at Nippon
eral million dollars a year. Steel’s Yawata works and has been almost 100%
This system controls a steel production process accurate.
called continuous casting. In this process, molten steel
is poured into one end of a special mold, where the
Financial Analysis
molded surface hardens into a solid shell around the
molten center. Then, the partially cooled steel is pulled ANNs can do nonlinear curve fitting on the basis of
out the other end of the mold. Everything works fine data points used to train the networks. This character-
unless the solid shell breaks, spilling molten steel and istic can be used to model natural or synthetic pro-
halting the process. This “breakout” appears to be cesses and then to control them by predicting future
caused by abnormal temperature gradients in the mold, values or states. Manufacturing processes such as the
which develop when the shell tears inside the mold. steel manufacturing described earlier are excellent ex-
The tear propagates down the mold toward a second amples of such processes. Financial decisions also can
opening. When the tear reaches the open end, a break- benefit from modeling complex. nonlinear processes
out occurs. Because a tear allows molten metal to to predict future values.
touch the surface of the mold, an incipient breakout is Financial commodities markets-for example bonds,
a moving hot spot on the mold. Such tears can be stocks, and currency exchange-can be \.ie\ved
spotted by strategically placing temperature sensing complex processes. Granted, these processes sre noisy
devices on the mold. Unfortunately, temperature fluc- and highly nonlinear. Making a profit by predicting
tuation on the mold makes it difficult to find the hot currency exchange rates or the price of a stock does
spot associated with a tear. Fujitsu and Nippon Steel not require perfect accuracy, however. Accounting for
developed an ANN application that recognizes break- all of the statistical variance is unneeded. What is
out almost perfectly. It has two sets of networks: the needed is only doing better than other people or
first set looks for certain hot spot shapes; the second, systems.
for motion. Both were developed using the back- Researchers in mathematical modeling of financial
propagation algorithm. transactions are finding that ANN models are powerful
The first type of network is trained to find a partic- estimators of these processes. Their results ars so good
ular temperature rise and fall between the input and that most practitioners have become secretiL.e about
output of the mold. Each sensor is sampled 10 times, their work. It is therefore difficult to get accurate in-
providing 10 time-shifted inputs for each network for- formation about how much research is being done in
ward pass. These networks identify potential breakout this area, or about the quality of results. One academic
profiles. The second type of network is trained on ad- group publishing some results is affiliated with the
jacent pairs of mold input sensors. These data are sam- London Business School and University College Lon-
pled and shifted in six steps, providing six time-shifted don, where Professor A. N. Refenes (1993) has estab-
inputs to each network. The output indicates whether lished the NeuroForecasting Centre. The Czntre has
adjacent sensors detect the breakout temperature pro- attracted more than El.2 million in funding from the
file. The final output is passed to the process-control British Department of Trade and Industry. Citicorp,
software which, if breakout conditions are signalled, Barclays-BZW, the Mars Corp.. and several pension
slows the rate of steel flow out of the mold. funds.
Training was done on data from 34 events including Under Professor Refenes’s direction, several ANN-
nine breakouts. Testing was on another 27 events in- based financial decision systems have been created for
cluding two breakouts. The system worked perfectly, computer-assisted trading in foreign exchange, stock
detecting breakouts 6.5 set earlier than a previous con- and bond valuation, commodity price prediction, and
trol system developed at considerable expense. The global capital markets. These systems have shown
17. Digital VLSI Architecture for Real-World Problems 3:
ter performance than traditional automatic systems. CONCLUSION
One network, trained to select trading strategies,
earned an average annual profit of 18%. A traditional
This chapter has given only a brief view into ti
system earned only 12.3%.
CNAPS product and into the decisions made during 1
As with all ANN systems, the more you know about
design. It has also briefly examined some real app
the environment you are modeling, the simpler the
cations that use this product. The reader should havt
network, and the better it will perform. One system
better idea about why the various design decisio;
developed at the NeuroForecasting Centre models were made during this process and the final outcon
international bond markets to predict when capital
of this effort. The CNAPS system has achieved i
should be allocated between bonds and cash. The sys-
goals in speed and performance and, as discussed,
tem models seven countries, with one network for each
finding its way into real world applications.
(Figure 14). Each network predicts the bond returns
for that country one month ahead. All seven predic-
tions for each month are then presented to a software- Acknowledgments
based portfolio management system. This system
allocates capital to the markets with the best predicted I would like to acknowledge, first and foremost, Adaptive Solution<
results-simultaneously minimizing risk. investors for their foresight and patience in financing the develop
Each country network was trained with historical ment of the CNAPS system. They are the unsung heros of this entir
effort. I would also like to acknowledge the following people an
bond market data for that country between the years their contributions to the chapter: Dr. Dave Andes of the Naval A,
1971 and 1988. The inputs are four to eight parame- Warfare Center. and Eric Means and Steve Pendleton of Adapti\
ters. such as oil prices, interest rates, precious metal Solutions.
prices. and so on. Network output is the bond return The Office of Naval Research funded the development of the
for the next month. According to Refenes, this system implementation of the Lynch/Granger model on the CNAPS systen
through Contracts No. NO0014 88 K 0329 and No. NOOO14-90-J.
returned 125% between 1989 and 1992; a more conven- 1349.
tional system earned only 34%. This improvement rep-
resents a significant return in the financial domain. This
system has actually been used to trade a real invest- References
ment of $10 million, earning 2.4% above a standard
benchmark in November and December of that year. Akers, L. A., Haghighi, S., & Rao, A. (1990). VLSI implementa-
tions of sensory processing systems. In Proceedings of the Neural
Netwjorks for Sensory and Motor Systems (NSMS) Workshop.
March.
Alspector, J. (199 1). Experimental evaluation of learning in a neural
microsystem. In Advances in Neural Information Processing Sys-
tems III. San Mateo, CA: Morgan Kaufman.
Baker, T, & Hammerstrom, D. (1989). Characterization of artificial
neural network algorithms. In 1989 International IEEE Sympo-
sium on Circuits and Systems, pp. 78-8 1, September.
c Australia Canada
b Griffin, M., et al. (1990). An 11 million transistor digital neural
T” 8 network execution engine. In The IEEE International Solid State
0 0
R t ..
R Circuits Conference.
s USA‘Cash s
Graf, H. P, Jackel, L. D., & Hubbard, W. E. (1988). VLSI implemen-
FIGURE 14 A simple neural network-based financial analyzer. tation of a neural network model. IEEE Computer, 21(3). 4149.
This network consists of seven simple subnetworks, each trained to Hammerstrom, D. (1990). A VLSI architecture for high-perfor-
predict bond futures in its respective market. An allocation expert mance, low-cost, on-chip learning. In Proceedings of the IJCNN.
system is used to allocate a fixed amount of cash to each market Hammerstrom, D. (1991). A highly parallel digital architecture for
(Refenes. 1993). neural network emulation. In J. G. Delgado-Frias & W. R. Moore
358 Dan Hammerstrom

(Eds.), VLSI for artijicial intelligence and neural nen*lorks. New Computer Society Press of the IEEE, Washington, DC.
York: Plenum Press. Ramacher, U. (1993). Multiprocessor and memory architecture of
jammerstrom, D. (1993a). Neural networks at work. In IEEE Spec- the neurocomputer synapse- 1. In Proceedings of the World Con-
trum, pp. 26-322, June. gress of Neural Networks.
jammerstrom, D. (1993b). Working with neural networks. In IEEE Refenes, A. N. (1993). Financial modeling using neural networks.
Spectrum, pp. 46-53, July. Commercial applications of parallel computing. UNICOM.
-Ioehfeld, M., & Fahlman, S. E. (1992). Learning with limited nu- Rumelhart, D., & McClellan, J. (1986). Parallel distribured pro-
merical precision using the cascade-correlation algorithm. IEEE cessing. New York: MIT Press.
Transactions on Neural Networks, 3(4), July. Sejnowski, T., & Rosenberg, C. (1986). NetTalk: A parallel network
Holler, M., Tam, S., Castro, H., & Benson, R. (1989). An electrically that learns to read aloud. Technical Report JHUIEECS-.Y6101,
trainable artificial neural network (eTANN) with 10,240 floating The Johns Hopkins University Electrical Engineering and Com-
gate synapses. In Proceedings of the IJCNN. puter Science Department.
.McCartor, H. (199 1). Back-propagation implementations on the Shoemaker, I? A., Carlin, M. J.. & Shimabukuro, R. L. (19901. Back-
adaptive solutions neurocomputer chip. In Advances in neural propagation learning with coarse quantization of weight updates.
information processing systems II, Denver. CO. San Mateo, CA: In Proceedings of the IJCNN.
Morgan Kaufman. Skinner, T., (1994). Harness multiprocessing power for DSP sys-
Mead, C. (1989). Analog VLSI and Neural Systems. New York: tems. In Electronic Design, February 7.
Addison-Wesley. Togawa, E. Ueda, T., Aramaki, T., & Tanaka. A. ( 199 1). Receptive
Means, E., & Hammerstrom, D. (1991). Piriform model execution field neural networks with shift tolerant capability for kclr!:i char-
on a neurocomputer. In Proceedings of the IJCNN. acter recognition. In Proceedings of rhe Internutionul Jotrrr Con-
Means, R. W., & Lisenbee, L. (199 1). Extensible linear floating point ference on Neural Netilorks. June.
simd neurocomputer array processor. In Proceedings of the IJCNN. Wawrzyneck, J., Asanovic, K.. & Morgan, N. (1993). The design of
Morgan, N. Ed. (1990). Artificial neural networks: Electronic im- a neuro-microprocessor. lEEE Transactions on Nelrrui .l;JD\,orks.
plementations. Computer Society Press Technology Series and 4(3), May.

View publication stats

You might also like