Digital Arithmetic: CSE 237D: Spring 2008 Topic #8 Professor Ryan Kastner

Digital Arithmetic
CSE 237D: Spring 2008

Topic #8
Professor Ryan Kastner
Data Representation
Floating point representation

Large dynamic range and high
precision
Costly
Fixed point representation

Requires fewer number of
resources
Comparable performance
Bitwidth analysis for trading off
estimation accuracy and the
number of fixed-point bits
8 bits is sufficient
Moorea Modem Receiver

Specification
Generalized multiple hypothesis test (GMHT)
Matching
Pursuit
Core
arg min
i
Matching
Pursuit
Core
Note: 112
samples/symbol +
112 samples for
channel clearing.
Matching
Pursuit
Core
Matching
Pursuit
Core
Walsh/m-Sequence
Waveforms
Chip rate 5 kcps, approx. 5 kHz bandwidth. Uses 25 kHz carrier.
Use 7 chip m-sequence c per Walsh symbol, 8 bits per Walsh symbol
bi. Composite symbol duration is thus T = 11.2 msec. (Longer than
maximum multipath spread.)
Symbol rate is 266 bps, or 133 bps using 11.2 msec. time guard band
for channel clearing.
11 msec.
Transmitted Signal
-1 1
-1 -1 -1 -1
-1
1 -1

1 1
-1 1
-1 -1 -1
Walsh/m-sequence Signal
Parameters
-1 1
-1 -1 -1 -1
-1
1 -1

1 1
-1 1
-1 -1 -1
8 Walsh Symbols

Specification
Matching
Pursuit
Core
arg min
i
Matching
Pursuit
Core
Channel
Estimation
Note: 112
samples/symbol +
112 samples for
channel clearing.
Matching
Pursuit
Core
Matching
Pursuit
Core
Matching Pursuits Core
Goal: Map matching pursuits to reconfigurable device

Parameterizable number of samples, data representation
Tradeoffs - Provides designs with various area, latency, energy,
MP(r, S, A, a)
NS
1 fori = 1, 2, ,
// compute matched filter (MF) outputs
0
T
V
2
i Si r
fi 0
3
gi 0
4
5 end for
System
6 q0 0
// do successive interference
cancellation
Nf
7 forj = 1, 2, ,
// update
MF outputs
j
j1
V V f q Aq
8
NS 1
9
fork = 0, 1, ,
j
gk vk ak
10
j *
Qk (vk ) gk
11
end for
12
q
13
j argmax{Qk }
k ,kq ,...,q
j1
fq gq
14
15 end for
16 return(f)
j
CLB
Block RAM
IP Core (Multiplier)
Design Tools
j1
*
-
j1
Matching Pursuits Algorithm
S
A
a
0
i
Q
g
V
k
j
control
control
j 1
Reconfigurable System
In Depth: Data Representation
History of Number Systems
Oldest Number System?
Fingers, but only 10

Toes, but only 20
Base 10, digital
Roman schools taught finger counting
multiplication/division on hands/toes
"Counting in binary is just like counting in decimal if you

are all thumbs." ~ Glaser and Way
Sand Tables
Stones in the sand

Three grooves with up to ten stones per groove
Calculate said to be derived from the Latin word "calcis because
limestone was used in the first sand tables.
"Base eight is just like base ten really, if you're missing two
fingers." ~ Tom Lehrer
Key Idea: Formal Notation
Notches on bones 8500 BC in

Africa, Europe
Count in multiples of some
basic number
Greeks, Romans extended

this fundamentally still the
same
Positional notation key
same symbol in different
spots has different meaning
5 or 10 based on fingers
Mayans used 360
Babylonians 60
Numbers
Any number system requires:

A set of digits
A set of possible values for the digits
Rules for interpreting the digits and values onto a number
Example: Roman Numerals

Symbols used to represent a value
Roman Numerals
1=I
5=V
10 = X
50 = L
100 = C
500 = D
1000 = M
For example: 2004 = MMVIII
Unsigned Number Systems

Unsigned
integer decimal systems
Set
of digits represented by a digit vector X = (Xn-1, Xn-2,,

X1, X0)
Set
of values for the digits: Si = {0, 1, 2, , 9}
Rules
for determine number:
Unsigned
binary systems
n1
X = X i 10i
i =0
Set
of digits represented by a digit vector X = (Xn-1, Xn-2,,

X1, X0)
Set
of values for the digits: Si = {0, 1}
Rules
n1
for determine number: X = X 2i

i
i =0
Other Useful Encodings

Some 4-bit number representation formats
Exponent in
{2, 1, 0, 1}
Significand in
{0, 1, 2, 3}
Base-2
logarithm
Source: Parhami
Encoding Numbers in 4 Bits

16 14 12 10
10
12
14
16
Number
format
Unsigned integers
Signed-magnitude
3 + 1 fixed-point, xxx.x
Signed fraction, .xxx
2s-compl. fraction, x.xxx

2 + 2 floating-point, s 2 e
e in [2, 1], s in [0, 3]
2 + 2 logarithmic (log = xx.xx)
log x
Source: Parhami
Sign and Magnitude

Representation
0000
1111
1110
-7
0001
+1
0010
-6
1101
-5
Decrement
1100
1011
+2
Signed values
(signed magnitude)
-4
-3
1010
Bit pattern
(representation)
0011
+3
+4
+5
+2
1001
0100
Increment
0101
+6
-1
-0
1000
+7
0110
0111
Source: Parhami
Sign and Magnitude Adder

x
S ign x S ign y
Comp
Compl xx
___
Add/Sub
Add/Sub
Control
SSelective
elective
Complement
complement
c out
S ign
Comp
Compl
s s
S ign s
Adder
cin
Selective
S
elective
complement
Complement
Source: Parhami
Biased Representations
0000
1111
1110
+7
-8
0001
-7
0010
+6
1101
Increment
1100
+4
1011
-6
Signed values
(biased by 8)
+5
+3
Bit pattern
(representation)
-4
0100
Increment
0101
-2
+1
1001
-5
-3
+2
1010
0011
0
1000
-1
0110
0111
Source: Parhami
Arithmetic with Biased Numbers

Addition/subtraction of biased numbers
x + y + bias = (x + bias) + (y + bias) bias
x y + bias = (x + bias) (y + bias) + bias
A power-of-2 (or 2a 1) bias simplifies addition/subtraction
Comparison of biased numbers:
Compare like ordinary unsigned numbers
find true difference by ordinary subtraction
We seldom perform arbitrary arithmetic on biased numbers
Main application: Exponent field of floating-point numbers
Source: Parhami
Ones Complement Number

Representation
0000
1111
1110
-0
+0
0001
+1
0010
-1
1101
-2
1100
+2
Signed values
(1s complement)
-3
1011
Unsigned
representations
-4
0011
+3
+4
+5
-5
1010
0101
+6
-6
1001
0100
-7
1000
+7
0110
0111
Ones complement = digit

complement (diminished radix
complement) system for r = 2
M = 2k ulp
(2k ulp) x = xcompl
Range of representable
numbers in with k whole bits:
from 2k1 + ulp to 2k1 ulp
Twos Complement Number

Representation
0000
1111
1110
-1
+0
0001
+1
0010
-2
1101
-3
1100
+2
Signed values
(2s complement)
-4
1011
Unsigned
representations
-5
0011
+3
+4
1010
1001
0101
+6
-7
-8
1000
+7
0110
0111
M = 2k
2k x = [(2k ulp) x] + ulp
+5
-6
0100
Twos complement = radix

complement system for r = 2
= xcompl + ulp
Range of representable
numbers in with k whole bits:
from 2k1 to 2k1 ulp
Source: Parhami
Twos Complement Adder/Subtractor

x
y
Controlled
complementation
0 1
Can replace
this mux with
k XOR gates
Mux
_
y or y
___
cout
Adder
s=x y
cin
add/sub
0 for addition,
1 for subtraction
Source: Parhami
Sign and Magnitude vs Twos

Complement
x
Sign x S ign y
Comp
Compl xx
SSelective
elective
Complement
complement
___
Add/Sub
Add/Sub
Control
Adder
c out
S ign
Comp
Compl
s s
Selective
S
elective
complement
Complement
S ign s
cin
Signed-magnitude
adder/subtractor is
significantly more
complex than a
simple adder
y
Controlled
complementation
s
0 1
Twos-complement
adder/subtractor
needs very little
hardware other than
a simple adder
Mux
_
y or y
___
cout
Adder
s=x y
cin
add/sub
0 for addition,
1 for subtraction
Source: Parhami
Fixed Point Representations

Allows
us to use rational numbers: a/b

Numbers represented in the form:
X = X a1X a2 L X1X 0 .X 1X 2 L X b
Unsigned
a1
X=
mappings
i
2
i
i=b
1 n1
i
X = b
2 X i
2 i=0
Twos complement mapping:

n2
n1
1
i
X = 2
X n1 + 2 X i
b
2
i =0
Fixed Point Properties
Resolution: Smallest non-zero magnitude

Directly related to the number of fractional bits (b)
Unsigned binary fixed point: resolution = 1/2 b
Range: Difference between most positive and most negative

number
Unsigned binary fixed point: range = 2 a 2-b
Largely dependent on number of integer bits
Accuracy: Magnitude of the max difference between a real value

and its representation
Unsigned binary fixed point: accuracy = 1/2 b+1
Accuracy(x) = resolution(x)/2
If one fractional bit, worst possible number is (since it is from both 0
and which are representable with 1 fractional bit
Example
Denote
unsigned fixed point systems as U(a,b)

Given fixed point number system U(6,2),
What
What
is number does 8A16 represent?
is the range of U(6,2)?

What is the resolution?
What is the accuracy?
Rules of Fixed Point Arithmetic

Unsigned Wordlength U(a,b): a + b bits
Signed Wordlength S(a,b): a + b + 1 bits
Unsigned Range U(a,b): 0 x 2 a 2-b
Signed Range S(a,b): -2a x 2a 2-b
Addition Z(a+1,b) = X(a1,b1) + Y(a2,b2)
Unsigned Multiplication:
X and Y must be scaled i.e. a1= a2 and b1= b2

U(a1,b1) x U(a2,b2) = U(a1 + a2, b1 + b2)
Signed Multiplication:
S(a1,b1) x S(a2,b2) = S(a1 + a2 + 1, b1 + b2)
In Depth: Arithmetic Operations
1 Bit Addition
B
Half Adder (HA)

HA
(2 : 2)
counter
Cou
S
A B Ci
t
Full Adder (HA)
FA
(3 : 2)
counter
HA
HA
Cou
Half Adder Implementations

x
y
x
y
(a) AND/XOR half-adder.

_
c
_
x
_
y
x
y
s
(b) NOR-gate half-adder.
x
s
y
(c) NAND-gate half-adder with complemented carry.
Source: Parhami
Full Adder Implementations

y x
y x
cout
HA
HA
cin
cout
cin
s
(a) Built of half-adders.
y
Mux
cout
0
1
2
3
0
1
s
0
1
2
3
cin
(b) Built as an AND-OR circuit.
(c) Suitable for CMOS realization.

Source: Parhami
Full Adder Implementations

x
y
HA
c out
HA
c in
x
y
c out
s
(a) FA built of two HAs
x
y
c out
0
1
2
3
0
1
2
3
1
s
(b) CMOS mux-based FA
c in
c in
s
(c) Two-level AND-OR FA
Source: Parhami
Bit Serial Addition
Perform
addition one bit at a time

Xi + Yi + C0-(i-1)
Result
stored in registered that is right shifted

Slow but small area
Ripple Carry Adder

Bn-1An-1
Cout FA
n-bit
Ripple
.
.
.
Carry
Adder
Sn-1
parallel adder
Area, delay?
B2 A 2
B1A1
B0A0
FA
FA
FA
S2
S1
S0
Bit
Cout
Cin
n-bit Two
Operand Adder
n
Cin
Another View of Ripple Carry

Adder
A
G3
C
P
3
G2
G1
P
1
G0
P
0
C
0
Carry Network
Faster Addition
We need to break the carry chain
The carry recurrence: ci+1 = gi + pi ci
gk1onlyppropagates
gk2 situations
pk2
k1
Observation: Carry
in certain
ck
g1
p1
g0
p0
...
ck1
ck2
c2
c1
c0
Bit positions
1514131211109876543210
1011011001101110
cout0101100111000011 cin
\__________/\__________________/\________/\____/
4632
Carry chains and their lengths
Manchester Adder
Ai
Kill,
Generate,
Propagate
(KGP)
i
Switched
Carry Chain
(SCC)
KGP
0
Ki Gi Pi
An-1Bn-1
SCC
Gi
Ki
...
Pi
A1 B 1
A0 B0
SCC
SCC
1
Cout
Ci+
Ci
Cn-1
C2
C1
Cin
Carry Look
Ahead
C0 = Cin
A0
B0
G
P
A
0
0
1
1
S
C1 = G0 + C0 P0
A1
B1
G
P
B
0
1
0
1
C-out
0
C-in
C-in
1
kill
propagate
propagate
generate
G = A and B
P = A xor B
C2 = G1 + G0 P1 + C0 P0 P1
A2
B2
G
P
S
C3 = G2 + G1 P2 + G0 P1 P2 + C0 P0 P1 P2
A3
B3
G
P
G
P
C4 = . . .
Plumbing as Carry Lookahead

Analogy
c0
g0
p0
c1
c0
g0
g1
p0
g0
p1
c2
g1
g2
g3
c4
c0
p1
p2
p3
p0
2 Bit Carry Lookahead Adder

2 bit
CLA
G01
P0
G00
P0
C
0 0
P1 =
P0P1
0 0
0P1 +
0
G1 = G
G1
C
2
C
1
4 Bit Carry Look Ahead

Complexity reduced by
deriving the carry-out
indirectly, but increases
critical path
c4
p3
g3
c3
p2
g2
Full carry lookahead is quite practical

for a 4-bit adder
c1
c2
c3
c4
=
=
=
=
g0 c0 p0
g1 g0 p1 c0 p0 p1
g2 g1 p2 g0 p1 p2 c0 p0 p1 p2
g3 g2 p3 g1 p2 p3 g0 p1 p2 p3
c0 p0 p1 p2 p3
p1
c2
g1
p0
c1
g0
c0
Source: Parhami
Carry Look Ahead, multiple

levels
c0
0
1
2
C0
A0
B0
A1
B1
A2
B2
A3
B3
c0
g0
3
4
5
p0
C0
G0
P0
C1
G1
c1
g1
7
8
p1
c2
g2
10
p2
11
c3
g3
12
13
p3
C4
14
15
P1
C2
G2
P2
C3
G3
P3
C16
Cascaded Carry Look-ahead (16-bit):

Abstraction
C
L
A
C0
G0
P0
C1 = G0 + C0 P0
4-bit
Adder
C2 = G1 + G0 P1 + C0 P0 P1
4-bit
Adder
C3 = G2 + G1 P2 + G0 P1 P2 + C0 P0 P1 P2
G
P
4-bit
Adder
C4 = . . .
Carry Lookahead Generator Plumbing

Analogy
g0
p0
p1
g1
p1
p2
p3
g2
p2
P0
g3
G0
p3
4 Bit Hierarchical CLA

A B
3
A B
2
A B
2 bit CLA
A B
0
2 bit CLA
C
0
G1
1
G1
P1
2 bit
CLG
P1
0
P1
1
G2 = G01P11 + G
11
C4 = C0P01P11 + G1 0P11 + 1G1

P2 =1
P0 P1
8 Bit Hierarchical CLA

A B AB
A B AB
7
2 bit
CLA
G1
3
2 bit
CLA
P1
3
G1
2
G1
2 bit
CLA
P1
2 bit CLG
A B AB
A B AB
2 bit
CLA
P1
G1
0
P1
0
2 bit CLG
G2
1
C
8
P2
G2
2 bit CLG
P2
0
C
0
Design Trick: Guess (or

Precompute)
CP(2n) = 2*CP(n)
n-bit adder
n-bit adder
CP(2n) = CP(n) + CP(mux)
n-bit adder
Cout
n-bit adder
n-bit adder
Carry-select adder
Pipelined Ripple Carry Adder

Bn-1An-1
n1
FF
B2A2
B1A1
B0A0
FA
FA
FA
..
.
..
.
.. ..
. .
Cout FA
...
n3
FF
Sn-1
..
.
S2
n2
FF
S1
n1
FF
S0
Cin
Multiple Operand Addition

Many
applications require summation of many

operands
What is best way to compute this?
Inner Product
Multiplication

---------

---------------
a
x
x0
x1
x2
x3
p
a2 0
a2 1
a2 2
a2 3

----------------
p (0)
p (1)
p (2)
p (3)
p (4)
p (5)
p (6)
s
Terminology
Serial Implementation
Oi[n]
Si[n + log i]
Two Operand Carry

Propagate Adder
Register S
Si+1[n + log (i+1)]

Tserial-multi-add
= O(m log(n + log m))

= O(m log n + m log log m)
Therefore, addition time grows superlinearly with n when k is fixed

and logarithmically with k for a given n
Parallel Implementation
..
.
O1[n]O2[nO3[n]O4[nO5[n]O6[nO7[n]O8[n
]
]
]
]
CPA
CPA
CPA
CPA
Om1[n]
O(log m)
CPA
Tree
CPA
Om2[n]
CPA
CPA
Om- Om[n
]
1[n]
CPA
CPA
Can we do this faster?

CPA
CPA
CPA
Ttree-fast-multi-add
S[n +
log m]
= O(log n + log(n + 1) + . . . + log(n + log2m 1))

= O(log m log n + log m log log m)
Carry Save Adder (CSA)

O3[n]O2[n]O1[n]
n
n-bit Carry
Save Adder
n
C[n] S[n]
O3[n]O2[n]O1[n]
FA
C[n] S[n]
O3[2]O2[2]O1[2] O3[1]O2[1]O1[1]
n-bit
Carry
...
Save
Adder
FA
C[2] S[2]
FA
C[1] S[1]
Carry Save Adders

Cut
cin
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
Carry-propagate adder
cout
Carry-save adder (CSA)
or
(3; 2)-counter
or
3-to-2 reduction circuit
Carry propagate adder (CPA) and

carry save adder (CSA) functions in
dot notation.
Full-adder
Half-adder
Specifying full- and halfadder blocks, with their

inputs and outputs, in
dot notation.
Source: Parhami
Serial CSA Implementation

Ci[n + log i]
Si[n + log i]
Oi[n]
Carry Save Adder
Register C
Ci+1[n + log (i+1)]
Register C
Si+1[n + log (i+1)]
Tserial-csa-multi-add = O(m)
In the end there are two operands (C, S)
Final Reduction (2:1)

C[n] S[n] S[i]C[i-1]C[3] S[3] C[2] S[2] C[1] S[1]
Cout
Bit i-1
Bit 2
Bit 1
Carry Propagate Adder
HA
T[n+1]T[i+1]
T[n+2]
T[3]
T[2]
T[1]
O5[n]O4[n]
O6[n]:xxxx O6[n]
O5[n]:xxxx
+O4[n]:xxxx
S1[n:1]:xxxx
C1[n+1:2]:xxxx
O3[n]
O2[n]
O1[n]
CSA
C1[n+1:2]
C1[n+1:2]:xxxx
S1[n:1]:xxxx
+C1[n+1:2]:xxxx
S2[n+1:1]:xxxxx
C2[n+2:3]:xxxx
S1[n:1]:xxxx
S2[n+1:1]:xxxxx
+C2[n+2:3]:xxxx
S3[n+2:1]:xxxxxx
C3[n+2:2]:xxxxx
CSA
S1[n:1]
S1[n:1
C1[n+1:2]
CSA
C2[n+2:3]
S2[n+1:1]
CSA
C3[n+2:3]
S3[n+2
Carry Save Arithmetic
B C
CSA
Delay = 3 +
log2(M + 3)
3 = height of CSA
tree
M = bitwidth of
operands
CSA
C
CSA
S
Tree height =
log1.5(N/2)
CSA
S
CLA
Carry Save Arithmetic

RCA
RCA
RCA
RCA
RCA
(M
+1)
(M
+2)
(M
+3)
(M
+4)
(M
+5)
Using Ripple carry adders

(RCAs)
Delay = (M+5) +
4
Delay thru CSA network =
3 + log1.5(M + 3)
Example Reduction by a CSA Tree

8
12 FAs
6 FAs
6 FAs
5 4 3 2 1 0
7 7 7 7 7 7
2 5 5 5 5 5 3
3 4 4 4 4 4 1
1 2 3 3 3 3 2 1
2 2 2 2 2 1 2 1
--Carry-propagate adder-1 1 1 1 1 1 1 1 1
Bit position
62 = 12 FAs
6 FAs
6 FAs
4 FAs + 1 HA
7-bit adder
Representing a seven-operand
addition in tabular form.
4 FAs + 1 HA
7-bit adder
Total cost = 7-bit adder + 28 FAs + 1 HA
Addition of seven 6-bit

numbers in dot notation.
A full-adder compacts 3 dots into 2

(compression ratio of 1.5)
A half-adder rearranges 2 dots
(no compression, but still useful)
Source: Parhami
Wallace and Dadda Reduction

Trees
12 FAs
Wallace tree:
Reduce the number
of operands at the
earliest possible
opportunity
6 FAs
6 FAs
4 FAs + 1 HA
7-bit adder
Addition of seven 6-bit

numbers in dot notation.
h
2
3
4
5
6
6 FAs
n(h)
4
6
9
13
19
Dadda tree:
Postpone the
reduction to the
extent possible
without causing
added delay
11 FAs
7 FAs
4 FAs + 1 HA
7-bit adder
Adding seven 6-bit numbers

using Daddas strategy.
Source: Parhami
Generalized Parallel Counters

Multicolumn
reduction
(5, 5; 4)-counter
Unequal
columns
Dot notation for a (5, 5; 4)-counter and the

use of such counters for reducing five
numbers to two numbers.
Gen. parallel counter = Parallel compressor

(2, 3; 3)-counter
Source: Parhami
Compressors
Compressors
allow for carry in and carry outs
O4[i]O3[i]O2[i] O1[i]
O4[i-1]
O3[i-1]
O2[i-1]
O1[i-1]
FA
FA
Cin[i]
Cout[i]
Cin[i-1]
Cout[i-1]
[4:2]
Compress
or
FA
C[i] S[i]
Bit i
[4:2]
Compres
sor
FA
C[i-1]S[i-1]
Bit i-1
[4 : 2] Compressor Adder
O4[n]O3[n]O2[n]O1[n]
n
n-bit [4:2] Adder

n
C[n] S[n]
O1[n]
O3[n]
O2[n]
O4[n]
[4:2] Compressor
C[n]S[n]
O1[2] O4[1]
O1[1]
O3[2]
O2[2]
O3[1]
O2[1]
O4[2]
n-bit
[4:2]
Adder
...
[4:2] Compressor
C[2]S[2]
[4:2] Compressor
C[1]S[1]
Higher Order Compressors

O5[i]O4[i]O3[i]
O1[i]
O2[i]
FA
O5[i-1]
O4[i-1]
O3[i-1]
O2[i-1]
O1[i-1]
FA
FA
[5:2]
Compress
or Bit i
FA
FA
C[i] S[i]
[5:2]
Compress
or Bit i-1
FA
C[i-1]S[i-1]

Specification
Matching
Pursuit
Core
arg min
i
Matching
Pursuit
Core
Linear System
Optimizations
Note: 112
samples/symbol +
112 samples for
channel clearing.
Matching
Pursuit
Core
Matching
Pursuit
Core
Linear System Optimization
Linear systems ubiquitous in signal processing applications

cos(0)
cos(0)
cos(0)
y0 cos(0)
y cos( ) cos(3 ) cos(5 ) cos(7 )
8
8
8
8
1
y2 cos( 4 ) cos(3 4 ) cos(5 4 ) cos(7 4 )
y3 cos( 8 ) cos( 8 ) cos( 8 ) cos( 8 )
x0
x
1
x2

x3
We have developed many methods for optimization to

hardware, software, FPGA [ASAP04, ASPDAC05,
DATE06, ICCD06, Journal of VLSI Signal Processing07]
1D linear systems on previous slide, aka FIR filters
FIR Filter Implementations:

Multiply Accumulate Method
Convolution of the latest L input samples. L is the number of

coefficients h(k) of the filter, and x(n) represents the input time
series.
y[n] = h[k] x[n-k]
k= 0, 1, ..., L-1
X [n]
x
hL-1
hL-2
hL-3
h1
h0
y [n]
z-1
z-1
z-1
...
z-1
Disadvantages
Large area on FPGA due to multipliers and the fact that full flexibility of
general purpose multipliers are not required
Limited number of embedded resources such as MAC engines,
multipliers, etc. in FPGAs
z-1

Distributed Arithmetic
Summation of inner product:
Ak=
Xk
constant coefficients
= input data
Y = Ak X k
k =1
We can
write each inputn1data in twos

complement: X = X 0 + X b 2b
b=1
Substituting
thisB 1into the above yields:
b
Y = Ak X
k 0 + X kb 2
k =1
b=1
K
Exchange order of the summations:

B 1K
b K
Y =
Ak X kb 2 + Ak ( X k 0 )
b=1 k =1
k =1

K
b K
Y =
Ak X kb 2 + Ak ( X k 0 )
b=1 k =1
k =1
B 1
From
previous slide:
How do we compute the bracketed term?
Multiply
a particular bit b of each of the inputs by the

A1, A2, , Ak
binary constants
Questions: Assume
we are looking at b=1 (LSB of

inputs), but this generalizes to any b
What
if each bit of Xk1 are 0 i.e. Xk1 = [000000]?
What
if X11 = 1 and remaining are 0? Xk1 = [000001]?
What
if X11 = 1, X21 = 1 and rest are 0? Xk1 = [0000011]?
Looking
at summations in a different way
A1 ( X10 + X11 21 + X12 22 +L + X1( B 1) 2
( B 1)
( B 1)
1
2
n1
+A
X
+
X
2
+
X
2
+L
+
X
2
b
2 (
20
21
22
)
2 ( B 1)
Y = Ak X
k 0 + X kb 2 Y =
M
k=1
b=1
K
( B 1)
AK ( X K 0 + X K1 21 + X K 2 22 +L + X K ( B 1) 2
K
b K
Y =
+ Ak (X k 0 )
Ak X kb 2

b=1 k=1
k=1
n1
1
[A1 X11 + A2 X 21 + A3 X 31 +L + AK X K1 ] 2
+ [A1 X12 + A2 X 22 + A3 X 32 +L + AK X K 2 ] 22
Y=
M
( B1)
+ [A1 X1( B 1) + A2 X 2 ( B 1) + A3 X 3 ( B 1) +L + AK X K ( B 1) ] 2
+A1 (X10 ) + A2 (X 20 ) + A3 (X 30 ) +L + AK (X K 0 )

X1b + X2b + +
XKb
Precision of constant bits

wide:
Usually equal to precision of
input data B
Address
Value
00
0
00
A1
00
00
A2
01
2K entry LUT
00
A1+. A2
.
10
.
11
.
.
.
.
11
11
A1+ A2 +
+AK
+
>>

Advantages
Replaces multiplication with LUT
Coefficients stored in LUTs
Disadvantages
Performance limited as next input
sample processed only after every
bit of the current input sample is
processed
Increasing number of bits to be
processed has a significant effect on
resource utilization
Larger size scaling accumulator
needed for higher number of bits
Increases critical path delay
Address
Data
0000
0001
C0
0010
C1
1111
C0+C1+C2+C3

The performance improved
by replication - process
multiple bits at a time
Significant effect on resource
utilization
More LUTs
Larger size scaling
accumulator

Add and Shift Method
X [n]
x
hL-1
z-1
hL-2
hL-3
h1
h0
y [n]
z-1
z-1
...
z-1
z-1
Idea: Constant Multiplication to

Shift/Add
Multiplication is expensive in hardware

Decompose constant multiplications into shifts and additions
Signed digits can reduce the number of additions/subtractions
13*X = (1101)2*X = X + X<<2 + X<<3
Canonical Signed Digits (CSD) (Knuth74)

(57)10 = (0110111)2 = (100-1001)CSD
Further reduction possible by common subexpression elimination
Up to 50% reduction (R.Hartley TCS96)
Introduction
Common subexpressions
= common digit patterns
4+, 4<<
F1 = 7*X = (0111)*X = X + X<<1 + X<<2

F2 = 13*X = (1101)*X = X + X<<2 + X<<3
0101
D1 = X + X<<2
F1 = D1 + X<<1
F2 = D1 + X<<3
=> X + X<<2
3+, 3<<
Good for single variable: FIR filters (transposed form)

Multiple variable? (DFT, DCT etc..??)
Linear Systems and polynomial

transformation
Y0
Y1
1 1 1 1
X0
2 1 -1 -2
X1
Y2
1 -1 -1 1
X2
Y3
1 -2 2 -1
X3
H.264
Integer
Transform
Decomposing constant multiplications
12+, 4<<
Y
Y00
Y
Y11
==
==
X
X00 ++ X
X11 ++ X
X22 ++ X
X33
X
X00<<1
<<1 ++ X
X11 -- X
X22 -- X
X33<<1
<<1
Y
Y22
Y
Y33
==
==
X
X00
X
X00
---
X
X11 -- X
X22 ++ X
X33
X
X11<<1
<<1 ++ X
X22<<1
<<1 -- X
X33
Linear Systems and polynomial

transformation
Y0
Y1
1 1 1 1
X0
2 1 -1 -2
X1
Y2
1 -1 -1 1
X2
Y3
1 -2 2 -1
X3
H.264
Integer
Transform
Polynomial Transformation
12+, 4<<
Y
Y00
Y
Y11
==
==
X
X00 ++ X
X11
X
X00LL ++ X
X11
Y
Y22
Y
Y33
==
==
X
X00
X
X00
---
++ X
X22
-- X
X22 --
++ X
X33
X
X33LL
X
X11 -- X
X22 ++ X
X33
X
X11LL ++ X
X22LL -- X
X33
H.264 Example
Select
Y
Y00
Y
Y11
==
==
X
X00 ++ X
X11
X
X00LL ++ X
X11
Y
Y22
Y
Y33
==
==
X
X00
X
X00
---
++ X
X22
-- X
X22 --
++ X
X33
X
X33LL
X
X11 -- X
X22 ++ X
X33
X
X11LL ++ X
X22LL -- X
X33
D0 = (X0 + X3)
H.264 Example
Y
Y00
Y
Y11
D
D00 ++ X
X11 ++ X
X22
X
X00LL ++ X
X11 -- X
X22 -- X
X33LL
Y
Y22 == D
D00 -- X
X11 -- X
X22
Y
Y33 == X
X00 -- X
X11LL ++ X
X22LL -- X
X33
Select
==
==
D1 = (X1 X2)
H.264 Example
Select
Y
Y00
Y
Y11
==
==
D
D00 ++ X
X11 ++ X
X22
X
X00LL ++ D
D11 -- X
X33LL
Y
Y22
Y
Y33
==
==
D
D00
X
X00
-- X
X11 -- X
X22
-- D
D11LL -- X
X33
D2 = (X1 + X2)
H.264 Example
Y
Y00
Y
Y11
D
D00 ++ D
D22
X
X00LL ++ D
D11 --X
X33LL
Y
Y22 == D
D00 -- D
D22
Y
Y33 == X
X00 -- D
D11LL -- X
X33
Select
==
==
D3 = (X0 X3)
Final Implementation
Extracting
4 divisors
8+, 2<<
D
D00 ==
D
D11 ==
D
D22 ==
D
D33 ==
X
X00 ++ X
X33
X
X11 X
X22
X
X11 ++ X
X22
X
X00 -- X
X33
Y
Y00 == D
D00 ++ D
D22
Y
Y11== D
D11 ++ D
D33LL
Y
Y22 == D
D00 -- D
D22
Y
Y33 == D
D33 D
D11LL
Original: 12+,
4<<
Rectangle
Covering:
10+, 3<<
FPGA FIR Filter Implementations:

F1 = A + B + C + D
F2 = A + B + C + E
Extracting Common
Expression (A + B + C)
Unoptimized
Expression Trees
Extracting Common
Expression (A + B)
Optimization
Resource Utilization + Performance

Results
Filter Implementation Using
Filter Implementation Using Xilinx

Coregen (PDA)
Filter
(# taps)
Slices
LUTs
FFs
Performance
(Msps)
Filter
(# taps)
Slices
LUTs
FFs
Performance
(Msps)
264
213
509
251
524
774
1012
245
10
474
406
916
222
10
781
1103
1480
222
13
386
334
749
252
13
929
1311
1775
199
20
856
705
1650
250
20
1191
1631
2288
199
28
1294
1145
2508
227
28
1774
2544
3381
199
41
2154
1719
4161
223
41
2475
3642
4748
222
61
3264
2591
6303
192
61
3528
5335
6812
199
119
6009
4821
11551
203
119
6484
9754
12539
205
151
7579
6098
14611
180
151
8274
12525
15988
199
Experimental Results
DA vs. Add and Shift Method
DA vs. Add and Shift Method
MAC vs. Add and Shift Method
Filter
(# taps)
Add Shift
Method
MAC
filter
Slices
Msps
Slices
Msps
264
296
219
262
10
475
296
418
253
13
387
296
462
253
20
851
271
790
251
28
1303
305
886
251
41
2178
296
1660
243
61
3284
247
1947
242
119
6025
294
3581
241
151
7623
294
7631
215
CSA CSE for Linear Systems

Y1 = X1 + X1<<2 + X2 + X2<<1 + X2<<2
D1 = X1 + X2 + X2<<1
Y2 = X1<<2 + X2<<2 + X2<<3
Y1 = (D1S + D1C) + X1<<2 + X2<<2

Y2 = (D1S + D1C)
Algebraic methods
Greedy
Iterative algorithm
Extracts
the best 3-term divisor

Rewrites the expressions containing it
SS
FF11=
=a
D
D2+
b++
D
D2c1CC+
+
+ded++ee
1 +
SS
FF22=
=a
D
D2+
b++
D
D2c1CC+
+
+dfd++f f
1 +
S
>> D12 = a
D1+
+
b+
D1 C +
c
d
Terminates
when there are no more common

subexpressions
Experimental results
Comparing # of CSAs
Average 38.4% reduction
Experimental results
FPGA synthesis
Virtex II FPGAs
Synthesized designs and performed place & route
Avg 14.1 % reduction in #Slices and Avg 12.9%

reduction in # LUTs
Avg 5.7% increase in the delay
Conclusions
Optimized
acoustic modem by focusing on

channel estimation and FIR filters
In depth study of parallelization, number
representation, arithmetic, and linear system
optimization
Matching
Pursuit
Core
arg min
i
Matching
Pursuit
Core
Note: 112
samples/symbol +
112 samples for
channel clearing.
Matching
Pursuit
Core
Matching
Pursuit
Core

Digital Arithmetic: CSE 237D: Spring 2008 Topic #8 Professor Ryan Kastner

Uploaded by

Copyright:

Available Formats

Digital Arithmetic: CSE 237D: Spring 2008 Topic #8 Professor Ryan Kastner

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Digital Arithmetic: CSE 237D: Spring 2008 Topic #8 Professor Ryan Kastner

Uploaded by

Copyright:

Available Formats

Digital Arithmetic

CSE 237D: Spring 2008

Floating point representation

Fixed point representation

Moorea Modem Receiver

Moorea Modem Receiver

Matching Pursuits Core

Goal: Map matching pursuits to reconfigurable device

Matching Pursuits Algorithm

In Depth: Data Representation

History of Number Systems

Oldest Number System?

Fingers, but only 10

"Counting in binary is just like counting in decimal if you

Stones in the sand

Key Idea: Formal Notation

Notches on bones 8500 BC in

Greeks, Romans extended

Any number system requires:

Example: Roman Numerals

For example: 2004 = MMVIII

Unsigned Number Systems

integer decimal systems

of digits represented by a digit vector X = (Xn-1, Xn-2,,

of values for the digits: Si = {0, 1, 2, , 9}

for determine number:

of digits represented by a digit vector X = (Xn-1, Xn-2,,

of values for the digits: Si = {0, 1}

for determine number: X = X 2i

Other Useful Encodings

Encoding Numbers in 4 Bits

2s-compl. fraction, x.xxx

2 + 2 logarithmic (log = xx.xx)

Sign and Magnitude

Sign and Magnitude Adder

Arithmetic with Biased Numbers

Ones Complement Number

Ones complement = digit

Twos Complement Number

Twos complement = radix

Twos Complement Adder/Subtractor

Sign and Magnitude vs Twos

Fixed Point Representations

us to use rational numbers: a/b

Twos complement mapping:

Fixed Point Properties

Resolution: Smallest non-zero magnitude

Range: Difference between most positive and most negative

Accuracy: Magnitude of the max difference between a real value

unsigned fixed point systems as U(a,b)

is number does 8A16 represent?

is the range of U(6,2)?

Rules of Fixed Point Arithmetic

X and Y must be scaled i.e. a1= a2 and b1= b2

S(a1,b1) x S(a2,b2) = S(a1 + a2 + 1, b1 + b2)

In Depth: Arithmetic Operations

Half Adder (HA)

Full Adder (HA)