Digital Arithmetic: CSE 237D: Spring 2008 Topic #8 Professor Ryan Kastner
Digital Arithmetic: CSE 237D: Spring 2008 Topic #8 Professor Ryan Kastner
Digital Arithmetic: CSE 237D: Spring 2008 Topic #8 Professor Ryan Kastner
Data Representation
arg min
i
Matching
Pursuit
Core
Note: 112
samples/symbol +
112 samples for
channel clearing.
Matching
Pursuit
Core
Matching
Pursuit
Core
Walsh/m-Sequence
Waveforms
Chip rate 5 kcps, approx. 5 kHz bandwidth. Uses 25 kHz carrier.
Use 7 chip m-sequence c per Walsh symbol, 8 bits per Walsh symbol
bi. Composite symbol duration is thus T = 11.2 msec. (Longer than
maximum multipath spread.)
Symbol rate is 266 bps, or 133 bps using 11.2 msec. time guard band
for channel clearing.
11 msec.
Transmitted Signal
-1 1
-1 -1 -1 -1
-1
1 -1
1 1
-1 1
-1 -1 -1
Walsh/m-sequence Signal
Parameters
-1 1
-1 -1 -1 -1
-1
1 -1
1 1
-1 1
-1 -1 -1
8 Walsh Symbols
arg min
i
Matching
Pursuit
Core
Channel
Estimation
Note: 112
samples/symbol +
112 samples for
channel clearing.
Matching
Pursuit
Core
Matching
Pursuit
Core
fq gq
14
15 end for
16 return(f)
j
CLB
Block RAM
IP Core (Multiplier)
Design Tools
j1
*
-
j1
S
A
a
0
i
Q
g
V
k
j
control
control
j 1
Reconfigurable System
"Base eight is just like base ten really, if you're missing two
fingers." ~ Tom Lehrer
5 or 10 based on fingers
Mayans used 360
Babylonians 60
Numbers
1=I
5=V
10 = X
50 = L
100 = C
500 = D
1000 = M
Set
Set
Rules
Unsigned
binary systems
n1
X = X i 10i
i =0
Set
Set
Rules
n1
Exponent in
{2, 1, 0, 1}
Significand in
{0, 1, 2, 3}
Base-2
logarithm
Source: Parhami
10
12
14
16
Number
format
Unsigned integers
Signed-magnitude
3 + 1 fixed-point, xxx.x
Signed fraction, .xxx
log x
Source: Parhami
1111
1110
-7
0001
+1
0010
-6
1101
-5
Decrement
1100
1011
+2
Signed values
(signed magnitude)
-4
-3
1010
Bit pattern
(representation)
0011
+3
+4
+5
+2
1001
0100
Increment
0101
+6
-1
-0
1000
+7
0110
0111
Source: Parhami
S ign x S ign y
Comp
Compl xx
___
Add/Sub
Add/Sub
Control
SSelective
elective
Complement
complement
c out
S ign
Comp
Compl
s s
S ign s
Adder
cin
Selective
S
elective
complement
Complement
Source: Parhami
Biased Representations
0000
1111
1110
+7
-8
0001
-7
0010
+6
1101
Increment
1100
+4
1011
-6
Signed values
(biased by 8)
+5
+3
Bit pattern
(representation)
-4
0100
Increment
0101
-2
+1
1001
-5
-3
+2
1010
0011
0
1000
-1
0110
0111
Source: Parhami
Source: Parhami
1111
1110
-0
+0
0001
+1
0010
-1
1101
-2
1100
+2
Signed values
(1s complement)
-3
1011
Unsigned
representations
-4
0011
+3
+4
+5
-5
1010
0101
+6
-6
1001
0100
-7
1000
+7
0110
0111
1111
1110
-1
+0
0001
+1
0010
-2
1101
-3
1100
+2
Signed values
(2s complement)
-4
1011
Unsigned
representations
-5
0011
+3
+4
1010
1001
0101
+6
-7
-8
1000
+7
0110
0111
M = 2k
2k x = [(2k ulp) x] + ulp
+5
-6
0100
= xcompl + ulp
Range of representable
numbers in with k whole bits:
from 2k1 to 2k1 ulp
Source: Parhami
y
Controlled
complementation
0 1
Can replace
this mux with
k XOR gates
Mux
_
y or y
___
cout
Adder
s=x y
cin
add/sub
0 for addition,
1 for subtraction
Source: Parhami
Sign x S ign y
Comp
Compl xx
SSelective
elective
Complement
complement
___
Add/Sub
Add/Sub
Control
Adder
c out
S ign
Comp
Compl
s s
Selective
S
elective
complement
Complement
S ign s
cin
Signed-magnitude
adder/subtractor is
significantly more
complex than a
simple adder
y
Controlled
complementation
s
0 1
Twos-complement
adder/subtractor
needs very little
hardware other than
a simple adder
Mux
_
y or y
___
cout
Adder
s=x y
cin
add/sub
0 for addition,
1 for subtraction
Source: Parhami
X = X a1X a2 L X1X 0 .X 1X 2 L X b
Unsigned
a1
X=
mappings
i
2
i
i=b
1 n1
i
X = b
2 X i
2 i=0
1
i
X = 2
X n1 + 2 X i
b
2
i =0
Example
Denote
Unsigned Multiplication:
Signed Multiplication:
1 Bit Addition
B
Cou
S
A B Ci
t
FA
(3 : 2)
counter
HA
HA
Cou
x
y
_
x
_
y
x
y
s
(b) NOR-gate half-adder.
x
s
y
(c) NAND-gate half-adder with complemented carry.
Source: Parhami
y x
cout
HA
HA
cin
cout
cin
s
(a) Built of half-adders.
y
Mux
cout
0
1
2
3
0
1
s
0
1
2
3
cin
HA
c out
HA
c in
x
y
c out
s
(a) FA built of two HAs
x
y
c out
0
1
2
3
0
1
2
3
1
s
c in
c in
s
(c) Two-level AND-OR FA
Source: Parhami
Perform
n-bit
Ripple
.
.
.
Carry
Adder
Sn-1
parallel adder
Area, delay?
B2 A 2
B1A1
B0A0
FA
FA
FA
S2
S1
S0
Bit
Cout
Cin
n-bit Two
Operand Adder
n
Cin
G3
C
P
3
G2
G1
P
1
G0
P
0
C
0
Carry Network
Faster Addition
We need to break the carry chain
The carry recurrence: ci+1 = gi + pi ci
gk1onlyppropagates
gk2 situations
pk2
k1
Observation: Carry
in certain
ck
g1
p1
g0
p0
...
ck1
ck2
c2
c1
c0
Bit positions
1514131211109876543210
1011011001101110
cout0101100111000011 cin
\__________/\__________________/\________/\____/
4632
Carry chains and their lengths
Manchester Adder
Ai
Kill,
Generate,
Propagate
(KGP)
i
Switched
Carry Chain
(SCC)
KGP
0
Ki Gi Pi
An-1Bn-1
SCC
Gi
Ki
...
Pi
A1 B 1
A0 B0
SCC
SCC
1
Cout
Ci+
Ci
Cn-1
C2
C1
Cin
Carry Look
Ahead
C0 = Cin
A0
B0
G
P
A
0
0
1
1
S
C1 = G0 + C0 P0
A1
B1
G
P
B
0
1
0
1
C-out
0
C-in
C-in
1
kill
propagate
propagate
generate
G = A and B
P = A xor B
C2 = G1 + G0 P1 + C0 P0 P1
A2
B2
G
P
S
C3 = G2 + G1 P2 + G0 P1 P2 + C0 P0 P1 P2
A3
B3
G
P
G
P
C4 = . . .
g0
p0
c1
c0
g0
g1
p0
g0
p1
c2
g1
g2
g3
c4
c0
p1
p2
p3
p0
G01
P0
G00
P0
C
0 0
P1 =
P0P1
0 0
0P1 +
0
G1 = G
G1
C
2
C
1
c4
p3
g3
c3
p2
g2
=
=
=
=
g0 c0 p0
g1 g0 p1 c0 p0 p1
g2 g1 p2 g0 p1 p2 c0 p0 p1 p2
g3 g2 p3 g1 p2 p3 g0 p1 p2 p3
c0 p0 p1 p2 p3
p1
c2
g1
p0
c1
g0
c0
Source: Parhami
c0
g0
3
4
5
p0
C0
G0
P0
C1
G1
c1
g1
7
8
p1
c2
g2
10
p2
11
c3
g3
12
13
p3
C4
14
15
P1
C2
G2
P2
C3
G3
P3
C16
C0
G0
P0
C1 = G0 + C0 P0
4-bit
Adder
C2 = G1 + G0 P1 + C0 P0 P1
4-bit
Adder
C3 = G2 + G1 P2 + G0 P1 P2 + C0 P0 P1 P2
G
P
4-bit
Adder
C4 = . . .
p0
p1
g1
p1
p2
p3
g2
p2
P0
g3
G0
p3
A B
2
A B
2 bit CLA
A B
0
2 bit CLA
C
0
G1
1
G1
P1
2 bit
CLG
P1
0
P1
1
G2 = G01P11 + G
11
A B AB
7
2 bit
CLA
G1
3
2 bit
CLA
P1
3
G1
2
G1
2 bit
CLA
P1
2 bit CLG
A B AB
A B AB
2 bit
CLA
P1
G1
0
P1
0
2 bit CLG
G2
1
C
8
P2
G2
2 bit CLG
P2
0
C
0
n-bit adder
n-bit adder
n-bit adder
Cout
n-bit adder
n-bit adder
Carry-select adder
B2A2
B1A1
B0A0
FA
FA
FA
..
.
..
.
.. ..
. .
Cout FA
...
n3
FF
Sn-1
..
.
S2
n2
FF
S1
n1
FF
S0
Cin
Multiplication
---------
---------------
a
x
x0
x1
x2
x3
p
a2 0
a2 1
a2 2
a2 3
----------------
p (0)
p (1)
p (2)
p (3)
p (4)
p (5)
p (6)
s
Terminology
Serial Implementation
Oi[n]
Si[n + log i]
Parallel Implementation
..
.
O1[n]O2[nO3[n]O4[nO5[n]O6[nO7[n]O8[n
]
]
]
]
CPA
CPA
CPA
CPA
Om1[n]
O(log m)
CPA
Tree
CPA
Om2[n]
CPA
CPA
Om- Om[n
]
1[n]
CPA
CPA
CPA
CPA
Ttree-fast-multi-add
S[n +
log m]
n-bit Carry
Save Adder
n
C[n] S[n]
O3[n]O2[n]O1[n]
FA
C[n] S[n]
O3[2]O2[2]O1[2] O3[1]O2[1]O1[1]
n-bit
Carry
...
Save
Adder
FA
C[2] S[2]
FA
C[1] S[1]
cin
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
Carry-propagate adder
cout
Carry-save adder (CSA)
or
(3; 2)-counter
or
3-to-2 reduction circuit
Full-adder
Half-adder
Si[n + log i]
Oi[n]
Register C
Ci+1[n + log (i+1)]
Register C
Si+1[n + log (i+1)]
Tserial-csa-multi-add = O(m)
In the end there are two operands (C, S)
Bit i-1
Bit 2
Bit 1
HA
T[n+1]T[i+1]
T[n+2]
T[3]
T[2]
T[1]
O5[n]O4[n]
O6[n]:xxxx O6[n]
O5[n]:xxxx
+O4[n]:xxxx
S1[n:1]:xxxx
C1[n+1:2]:xxxx
O3[n]
O2[n]
O1[n]
CSA
C1[n+1:2]
C1[n+1:2]:xxxx
S1[n:1]:xxxx
+C1[n+1:2]:xxxx
S2[n+1:1]:xxxxx
C2[n+2:3]:xxxx
S1[n:1]:xxxx
S2[n+1:1]:xxxxx
+C2[n+2:3]:xxxx
S3[n+2:1]:xxxxxx
C3[n+2:2]:xxxxx
CSA
S1[n:1]
S1[n:1
C1[n+1:2]
CSA
C2[n+2:3]
S2[n+1:1]
CSA
C3[n+2:3]
S3[n+2
B C
CSA
Delay = 3 +
log2(M + 3)
3 = height of CSA
tree
M = bitwidth of
operands
CSA
C
CSA
S
Tree height =
log1.5(N/2)
CSA
S
CLA
(M
+1)
(M
+2)
(M
+3)
(M
+4)
(M
+5)
Delay = (M+5) +
4
Delay thru CSA network =
3 + log1.5(M + 3)
12 FAs
6 FAs
6 FAs
5 4 3 2 1 0
7 7 7 7 7 7
2 5 5 5 5 5 3
3 4 4 4 4 4 1
1 2 3 3 3 3 2 1
2 2 2 2 2 1 2 1
--Carry-propagate adder-1 1 1 1 1 1 1 1 1
Bit position
62 = 12 FAs
6 FAs
6 FAs
4 FAs + 1 HA
7-bit adder
Representing a seven-operand
addition in tabular form.
4 FAs + 1 HA
7-bit adder
Total cost = 7-bit adder + 28 FAs + 1 HA
12 FAs
Wallace tree:
Reduce the number
of operands at the
earliest possible
opportunity
6 FAs
6 FAs
4 FAs + 1 HA
7-bit adder
Total cost = 7-bit adder + 28 FAs + 1 HA
h
2
3
4
5
6
6 FAs
n(h)
4
6
9
13
19
Dadda tree:
Postpone the
reduction to the
extent possible
without causing
added delay
11 FAs
7 FAs
4 FAs + 1 HA
7-bit adder
Total cost = 7-bit adder + 28 FAs + 1 HA
(5, 5; 4)-counter
Unequal
columns
Source: Parhami
Compressors
Compressors
O4[i]O3[i]O2[i] O1[i]
O4[i-1]
O3[i-1]
O2[i-1]
O1[i-1]
FA
FA
Cin[i]
Cout[i]
Cin[i-1]
Cout[i-1]
[4:2]
Compress
or
FA
C[i] S[i]
Bit i
[4:2]
Compres
sor
FA
C[i-1]S[i-1]
Bit i-1
[4 : 2] Compressor Adder
O4[n]O3[n]O2[n]O1[n]
n
C[n] S[n]
O1[n]
O3[n]
O2[n]
O4[n]
[4:2] Compressor
C[n]S[n]
O1[2] O4[1]
O1[1]
O3[2]
O2[2]
O3[1]
O2[1]
O4[2]
n-bit
[4:2]
Adder
...
[4:2] Compressor
C[2]S[2]
[4:2] Compressor
C[1]S[1]
FA
O5[i-1]
O4[i-1]
O3[i-1]
O2[i-1]
O1[i-1]
FA
FA
[5:2]
Compress
or Bit i
FA
FA
C[i] S[i]
[5:2]
Compress
or Bit i-1
FA
C[i-1]S[i-1]
arg min
i
Matching
Pursuit
Core
Linear System
Optimizations
Note: 112
samples/symbol +
112 samples for
channel clearing.
Matching
Pursuit
Core
Matching
Pursuit
Core
x0
x
1
x2
x3
k= 0, 1, ..., L-1
X [n]
x
hL-1
hL-2
hL-3
h1
h0
y [n]
z-1
z-1
z-1
...
z-1
Disadvantages
Large area on FPGA due to multipliers and the fact that full flexibility of
general purpose multipliers are not required
Limited number of embedded resources such as MAC engines,
multipliers, etc. in FPGAs
z-1
constant coefficients
= input data
Y = Ak X k
k =1
We can
Substituting
b
Y = Ak X
k 0 + X kb 2
k =1
b=1
K
b=1 k =1
k =1
b=1 k =1
k =1
B 1
From
previous slide:
How do we compute the bracketed term?
Multiply
Questions: Assume
What
What
Distributed Arithmetic
Looking
( B 1)
( B 1)
1
2
n1
+A
X
+
X
2
+
X
2
+L
+
X
2
b
2 (
20
21
22
)
2 ( B 1)
Y = Ak X
k 0 + X kb 2 Y =
M
k=1
b=1
K
( B 1)
AK ( X K 0 + X K1 21 + X K 2 22 +L + X K ( B 1) 2
K
b K
Y =
+ Ak (X k 0 )
Ak X kb 2
b=1 k=1
k=1
n1
1
[A1 X11 + A2 X 21 + A3 X 31 +L + AK X K1 ] 2
+ [A1 X12 + A2 X 22 + A3 X 32 +L + AK X K 2 ] 22
Y=
M
( B1)
+ [A1 X1( B 1) + A2 X 2 ( B 1) + A3 X 3 ( B 1) +L + AK X K ( B 1) ] 2
+A1 (X10 ) + A2 (X 20 ) + A3 (X 30 ) +L + AK (X K 0 )
X1b + X2b + +
XKb
11
11
A1+ A2 +
+AK
+
>>
Advantages
Replaces multiplication with LUT
Coefficients stored in LUTs
Disadvantages
Performance limited as next input
sample processed only after every
bit of the current input sample is
processed
Increasing number of bits to be
processed has a significant effect on
resource utilization
Larger size scaling accumulator
needed for higher number of bits
Increases critical path delay
Address
Data
0000
0001
C0
0010
C1
1111
C0+C1+C2+C3
More LUTs
Larger size scaling
accumulator
hL-1
z-1
hL-2
hL-3
h1
h0
y [n]
z-1
z-1
...
z-1
z-1
Introduction
Common subexpressions
= common digit patterns
4+, 4<<
D1 = X + X<<2
F1 = D1 + X<<1
F2 = D1 + X<<3
=> X + X<<2
3+, 3<<
1 1 1 1
X0
2 1 -1 -2
X1
Y2
1 -1 -1 1
X2
Y3
1 -2 2 -1
X3
H.264
Integer
Transform
12+, 4<<
Y
Y00
Y
Y11
==
==
X
X00 ++ X
X11 ++ X
X22 ++ X
X33
X
X00<<1
<<1 ++ X
X11 -- X
X22 -- X
X33<<1
<<1
Y
Y22
Y
Y33
==
==
X
X00
X
X00
---
X
X11 -- X
X22 ++ X
X33
X
X11<<1
<<1 ++ X
X22<<1
<<1 -- X
X33
1 1 1 1
X0
2 1 -1 -2
X1
Y2
1 -1 -1 1
X2
Y3
1 -2 2 -1
X3
H.264
Integer
Transform
Polynomial Transformation
12+, 4<<
Y
Y00
Y
Y11
==
==
X
X00 ++ X
X11
X
X00LL ++ X
X11
Y
Y22
Y
Y33
==
==
X
X00
X
X00
---
++ X
X22
-- X
X22 --
++ X
X33
X
X33LL
X
X11 -- X
X22 ++ X
X33
X
X11LL ++ X
X22LL -- X
X33
H.264 Example
Select
Y
Y00
Y
Y11
==
==
X
X00 ++ X
X11
X
X00LL ++ X
X11
Y
Y22
Y
Y33
==
==
X
X00
X
X00
---
++ X
X22
-- X
X22 --
++ X
X33
X
X33LL
X
X11 -- X
X22 ++ X
X33
X
X11LL ++ X
X22LL -- X
X33
D0 = (X0 + X3)
H.264 Example
Y
Y00
Y
Y11
D
D00 ++ X
X11 ++ X
X22
X
X00LL ++ X
X11 -- X
X22 -- X
X33LL
Y
Y22 == D
D00 -- X
X11 -- X
X22
Y
Y33 == X
X00 -- X
X11LL ++ X
X22LL -- X
X33
Select
==
==
D1 = (X1 X2)
H.264 Example
Select
Y
Y00
Y
Y11
==
==
D
D00 ++ X
X11 ++ X
X22
X
X00LL ++ D
D11 -- X
X33LL
Y
Y22
Y
Y33
==
==
D
D00
X
X00
-- X
X11 -- X
X22
-- D
D11LL -- X
X33
D2 = (X1 + X2)
H.264 Example
Y
Y00
Y
Y11
D
D00 ++ D
D22
X
X00LL ++ D
D11 --X
X33LL
Y
Y22 == D
D00 -- D
D22
Y
Y33 == X
X00 -- D
D11LL -- X
X33
Select
==
==
D3 = (X0 X3)
Final Implementation
Extracting
4 divisors
8+, 2<<
D
D00 ==
D
D11 ==
D
D22 ==
D
D33 ==
X
X00 ++ X
X33
X
X11 X
X22
X
X11 ++ X
X22
X
X00 -- X
X33
Y
Y00 == D
D00 ++ D
D22
Y
Y11== D
D11 ++ D
D33LL
Y
Y22 == D
D00 -- D
D22
Y
Y33 == D
D33 D
D11LL
Original: 12+,
4<<
Rectangle
Covering:
10+, 3<<
Unoptimized
Expression Trees
Extracting Common
Expression (A + B)
Optimization
Filter
(# taps)
Slices
LUTs
FFs
Performance
(Msps)
Filter
(# taps)
Slices
LUTs
FFs
Performance
(Msps)
264
213
509
251
524
774
1012
245
10
474
406
916
222
10
781
1103
1480
222
13
386
334
749
252
13
929
1311
1775
199
20
856
705
1650
250
20
1191
1631
2288
199
28
1294
1145
2508
227
28
1774
2544
3381
199
41
2154
1719
4161
223
41
2475
3642
4748
222
61
3264
2591
6303
192
61
3528
5335
6812
199
119
6009
4821
11551
203
119
6484
9754
12539
205
151
7579
6098
14611
180
151
8274
12525
15988
199
Experimental Results
DA vs. Add and Shift Method
Experimental Results
DA vs. Add and Shift Method
Experimental Results
MAC vs. Add and Shift Method
Filter
(# taps)
Add Shift
Method
MAC
filter
Slices
Msps
Slices
Msps
264
296
219
262
10
475
296
418
253
13
387
296
462
253
20
851
271
790
251
28
1303
305
886
251
41
2178
296
1660
243
61
3284
247
1947
242
119
6025
294
3581
241
151
7623
294
7631
215
Experimental Results
MAC vs. Add and Shift Method
Experimental Results
MAC vs. Add and Shift Method
D1 = X1 + X2 + X2<<1
Algebraic methods
Greedy
Iterative algorithm
Extracts
S
>> D12 = a
D1+
+
b+
D1 C +
c
d
Terminates
Experimental results
Comparing # of CSAs
Experimental results
FPGA synthesis
Virtex II FPGAs
Synthesized designs and performed place & route
Conclusions
Optimized
Matching
Pursuit
Core
arg min
i
Matching
Pursuit
Core
Note: 112
samples/symbol +
112 samples for
channel clearing.
Matching
Pursuit
Core
Matching
Pursuit
Core