Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Digital Arithmetic: CSE 237D: Spring 2008 Topic #8 Professor Ryan Kastner

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 97

Digital Arithmetic

CSE 237D: Spring 2008


Topic #8
Professor Ryan Kastner

Data Representation

Floating point representation


Large dynamic range and high
precision
Costly

Fixed point representation


Requires fewer number of
resources
Comparable performance
Bitwidth analysis for trading off
estimation accuracy and the
number of fixed-point bits
8 bits is sufficient

Moorea Modem Receiver


Specification
Generalized multiple hypothesis test (GMHT)
Matching
Pursuit
Core

arg min
i

Matching
Pursuit
Core

Note: 112
samples/symbol +
112 samples for
channel clearing.

Matching
Pursuit
Core
Matching
Pursuit
Core

Walsh/m-Sequence
Waveforms
Chip rate 5 kcps, approx. 5 kHz bandwidth. Uses 25 kHz carrier.
Use 7 chip m-sequence c per Walsh symbol, 8 bits per Walsh symbol
bi. Composite symbol duration is thus T = 11.2 msec. (Longer than
maximum multipath spread.)
Symbol rate is 266 bps, or 133 bps using 11.2 msec. time guard band
for channel clearing.

11 msec.

Transmitted Signal

-1 1

-1 -1 -1 -1

-1

1 -1


1 1

-1 1

-1 -1 -1

Walsh/m-sequence Signal
Parameters

-1 1

-1 -1 -1 -1

-1

1 -1


1 1

-1 1

-1 -1 -1

8 Walsh Symbols

Moorea Modem Receiver


Specification
Generalized multiple hypothesis test (GMHT)
Matching
Pursuit
Core

arg min
i

Matching
Pursuit
Core

Channel
Estimation

Note: 112
samples/symbol +
112 samples for
channel clearing.

Matching
Pursuit
Core
Matching
Pursuit
Core

Matching Pursuits Core

Goal: Map matching pursuits to reconfigurable device


Parameterizable number of samples, data representation
Tradeoffs - Provides designs with various area, latency, energy,
MP(r, S, A, a)
NS
1 fori = 1, 2, ,
// compute matched filter (MF) outputs
0
T
V
2
i Si r
fi 0
3
gi 0
4
5 end for
System
6 q0 0
// do successive interference
cancellation
Nf
7 forj = 1, 2, ,
// update
MF outputs
j
j1
V V f q Aq
8
NS 1
9
fork = 0, 1, ,
j
gk vk ak
10
j *
Qk (vk ) gk
11
end for
12
q
13
j argmax{Qk }
k ,kq ,...,q
j1

fq gq
14
15 end for
16 return(f)
j

CLB

Block RAM

IP Core (Multiplier)

Design Tools

j1

*
-

j1

Matching Pursuits Algorithm

S
A
a

0
i

Q
g
V

k
j

control

control

j 1

Reconfigurable System

In Depth: Data Representation

History of Number Systems

Oldest Number System?

Fingers, but only 10


Toes, but only 20
Base 10, digital
Roman schools taught finger counting
multiplication/division on hands/toes

"Counting in binary is just like counting in decimal if you


are all thumbs." ~ Glaser and Way
Sand Tables

Stones in the sand


Three grooves with up to ten stones per groove
Calculate said to be derived from the Latin word "calcis because
limestone was used in the first sand tables.

"Base eight is just like base ten really, if you're missing two
fingers." ~ Tom Lehrer

Key Idea: Formal Notation

Notches on bones 8500 BC in


Africa, Europe
Count in multiples of some
basic number

Greeks, Romans extended


this fundamentally still the
same
Positional notation key
same symbol in different
spots has different meaning

5 or 10 based on fingers
Mayans used 360
Babylonians 60

Numbers

Any number system requires:


A set of digits
A set of possible values for the digits
Rules for interpreting the digits and values onto a number

Example: Roman Numerals


Symbols used to represent a value
Roman Numerals

1=I
5=V
10 = X
50 = L

100 = C
500 = D
1000 = M

For example: 2004 = MMVIII

Unsigned Number Systems


Unsigned

integer decimal systems

Set

of digits represented by a digit vector X = (Xn-1, Xn-2,,


X1, X0)

Set

of values for the digits: Si = {0, 1, 2, , 9}

Rules

for determine number:

Unsigned

binary systems

n1

X = X i 10i
i =0

Set

of digits represented by a digit vector X = (Xn-1, Xn-2,,


X1, X0)

Set

of values for the digits: Si = {0, 1}

Rules

n1

for determine number: X = X 2i


i
i =0

Other Useful Encodings


Some 4-bit number representation formats

Exponent in
{2, 1, 0, 1}

Significand in
{0, 1, 2, 3}

Base-2
logarithm

Source: Parhami

Encoding Numbers in 4 Bits


16 14 12 10

10

12

14

16

Number
format

Unsigned integers
Signed-magnitude

3 + 1 fixed-point, xxx.x
Signed fraction, .xxx

2s-compl. fraction, x.xxx


2 + 2 floating-point, s 2 e
e in [2, 1], s in [0, 3]

2 + 2 logarithmic (log = xx.xx)

log x

Source: Parhami

Sign and Magnitude


Representation
0000

1111
1110

-7

0001
+1

0010

-6
1101
-5

Decrement

1100

1011

+2

Signed values
(signed magnitude)

-4

-3

1010

Bit pattern
(representation)

0011
+3
+4
+5

+2
1001

0100

Increment

0101

+6
-1

-0
1000

+7

0110
0111

Source: Parhami

Sign and Magnitude Adder


x

S ign x S ign y

Comp
Compl xx
___
Add/Sub
Add/Sub

Control

SSelective
elective
Complement
complement

c out
S ign
Comp
Compl
s s

S ign s

Adder

cin

Selective
S
elective
complement
Complement

Source: Parhami

Biased Representations
0000

1111
1110

+7

-8

0001
-7

0010

+6
1101

Increment

1100

+4

1011

-6

Signed values
(biased by 8)

+5

+3

Bit pattern
(representation)

-4

0100

Increment

0101

-2
+1

1001

-5

-3

+2
1010

0011

0
1000

-1

0110
0111

Source: Parhami

Arithmetic with Biased Numbers


Addition/subtraction of biased numbers
x + y + bias = (x + bias) + (y + bias) bias
x y + bias = (x + bias) (y + bias) + bias
A power-of-2 (or 2a 1) bias simplifies addition/subtraction
Comparison of biased numbers:
Compare like ordinary unsigned numbers
find true difference by ordinary subtraction
We seldom perform arbitrary arithmetic on biased numbers
Main application: Exponent field of floating-point numbers

Source: Parhami

Ones Complement Number


Representation
0000

1111
1110

-0

+0

0001
+1

0010

-1
1101
-2
1100

+2

Signed values
(1s complement)

-3

1011

Unsigned
representations

-4

0011
+3
+4
+5

-5
1010

0101

+6
-6

1001

0100

-7
1000

+7

0110
0111

Ones complement = digit


complement (diminished radix
complement) system for r = 2
M = 2k ulp
(2k ulp) x = xcompl
Range of representable
numbers in with k whole bits:
from 2k1 + ulp to 2k1 ulp

Twos Complement Number


Representation
0000

1111
1110

-1

+0

0001
+1

0010

-2
1101
-3
1100

+2

Signed values
(2s complement)

-4

1011

Unsigned
representations

-5

0011
+3
+4

1010
1001

0101

+6
-7

-8
1000

+7

0110
0111

M = 2k
2k x = [(2k ulp) x] + ulp

+5
-6

0100

Twos complement = radix


complement system for r = 2

= xcompl + ulp
Range of representable
numbers in with k whole bits:
from 2k1 to 2k1 ulp

Source: Parhami

Twos Complement Adder/Subtractor


x

y
Controlled
complementation
0 1

Can replace
this mux with
k XOR gates

Mux
_
y or y

___

cout

Adder
s=x y

cin

add/sub
0 for addition,
1 for subtraction

Source: Parhami

Sign and Magnitude vs Twos


Complement
x

Sign x S ign y

Comp
Compl xx

SSelective
elective
Complement
complement

___

Add/Sub
Add/Sub

Control

Adder

c out
S ign
Comp
Compl
s s

Selective
S
elective
complement
Complement

S ign s

cin

Signed-magnitude
adder/subtractor is
significantly more
complex than a
simple adder

y
Controlled
complementation

s
0 1

Twos-complement
adder/subtractor
needs very little
hardware other than
a simple adder

Mux
_
y or y

___

cout

Adder
s=x y

cin

add/sub
0 for addition,
1 for subtraction

Source: Parhami

Fixed Point Representations


Allows

us to use rational numbers: a/b


Numbers represented in the form:

X = X a1X a2 L X1X 0 .X 1X 2 L X b
Unsigned
a1

X=

mappings
i
2
i

i=b

1 n1
i
X = b
2 X i
2 i=0

Twos complement mapping:


n2
n1

1
i
X = 2
X n1 + 2 X i

b
2

i =0

Fixed Point Properties

Resolution: Smallest non-zero magnitude


Directly related to the number of fractional bits (b)
Unsigned binary fixed point: resolution = 1/2 b

Range: Difference between most positive and most negative


number
Unsigned binary fixed point: range = 2 a 2-b
Largely dependent on number of integer bits

Accuracy: Magnitude of the max difference between a real value


and its representation
Unsigned binary fixed point: accuracy = 1/2 b+1
Accuracy(x) = resolution(x)/2
If one fractional bit, worst possible number is (since it is from both 0
and which are representable with 1 fractional bit

Example
Denote

unsigned fixed point systems as U(a,b)


Given fixed point number system U(6,2),
What
What

is number does 8A16 represent?

is the range of U(6,2)?


What is the resolution?
What is the accuracy?

Rules of Fixed Point Arithmetic


Unsigned Wordlength U(a,b): a + b bits
Signed Wordlength S(a,b): a + b + 1 bits
Unsigned Range U(a,b): 0 x 2 a 2-b
Signed Range S(a,b): -2a x 2a 2-b
Addition Z(a+1,b) = X(a1,b1) + Y(a2,b2)

Unsigned Multiplication:

X and Y must be scaled i.e. a1= a2 and b1= b2


U(a1,b1) x U(a2,b2) = U(a1 + a2, b1 + b2)

Signed Multiplication:

S(a1,b1) x S(a2,b2) = S(a1 + a2 + 1, b1 + b2)

In Depth: Arithmetic Operations

1 Bit Addition
B

Half Adder (HA)


HA
(2 : 2)
counter

Cou
S
A B Ci
t

Full Adder (HA)

FA
(3 : 2)
counter

HA
HA

Cou

Half Adder Implementations


x
y

x
y

(a) AND/XOR half-adder.


_
c

_
x
_
y

x
y

s
(b) NOR-gate half-adder.
x

s
y
(c) NAND-gate half-adder with complemented carry.

Source: Parhami

Full Adder Implementations


y x

y x
cout

HA
HA

cin

cout

cin

s
(a) Built of half-adders.
y

Mux
cout

0
1
2
3

0
1
s

0
1
2
3

cin

(b) Built as an AND-OR circuit.

(c) Suitable for CMOS realization.


Source: Parhami

Full Adder Implementations


x
y

HA

c out
HA

c in

x
y
c out

s
(a) FA built of two HAs
x
y

c out

0
1
2
3

0
1
2
3

1
s

(b) CMOS mux-based FA

c in

c in
s
(c) Two-level AND-OR FA

Source: Parhami

Bit Serial Addition

Perform

addition one bit at a time


Xi + Yi + C0-(i-1)
Result

stored in registered that is right shifted


Slow but small area

Ripple Carry Adder


Bn-1An-1
Cout FA

n-bit
Ripple
.
.
.
Carry
Adder

Sn-1

parallel adder
Area, delay?

B2 A 2

B1A1

B0A0

FA

FA

FA

S2

S1

S0

Bit

Cout

Cin

n-bit Two
Operand Adder
n

Cin

Another View of Ripple Carry


Adder
A

G3
C

P
3

G2

G1

P
1

G0

P
0

C
0

Carry Network

Faster Addition
We need to break the carry chain
The carry recurrence: ci+1 = gi + pi ci

gk1onlyppropagates
gk2 situations
pk2
k1
Observation: Carry
in certain
ck

g1

p1

g0

p0

...
ck1

ck2

c2

c1

c0

Bit positions

1514131211109876543210

1011011001101110
cout0101100111000011 cin
\__________/\__________________/\________/\____/
4632
Carry chains and their lengths

Manchester Adder

Ai

Kill,
Generate,
Propagate
(KGP)

i
Switched
Carry Chain
(SCC)
KGP
0
Ki Gi Pi

An-1Bn-1

SCC

Gi

Ki
...

Pi

A1 B 1

A0 B0

SCC

SCC

1
Cout

Ci+

Ci

Cn-1

C2

C1

Cin

Carry Look
Ahead
C0 = Cin
A0
B0

G
P

A
0
0
1
1

S
C1 = G0 + C0 P0

A1
B1

G
P

B
0
1
0
1

C-out
0
C-in
C-in
1

kill
propagate
propagate
generate

G = A and B
P = A xor B
C2 = G1 + G0 P1 + C0 P0 P1

A2
B2

G
P

S
C3 = G2 + G1 P2 + G0 P1 P2 + C0 P0 P1 P2

A3
B3

G
P

G
P
C4 = . . .

Plumbing as Carry Lookahead


Analogy
c0

g0

p0

c1

c0

g0
g1

p0
g0

p1

c2

g1
g2
g3

c4

c0

p1
p2

p3

p0

2 Bit Carry Lookahead Adder


2 bit
CLA

G01

P0

G00

P0

C
0 0
P1 =
P0P1
0 0
0P1 +
0
G1 = G
G1

C
2

C
1

4 Bit Carry Look Ahead


Complexity reduced by
deriving the carry-out
indirectly, but increases
critical path

c4
p3
g3

c3
p2
g2

Full carry lookahead is quite practical


for a 4-bit adder
c1
c2
c3
c4

=
=
=
=

g0 c0 p0
g1 g0 p1 c0 p0 p1
g2 g1 p2 g0 p1 p2 c0 p0 p1 p2
g3 g2 p3 g1 p2 p3 g0 p1 p2 p3
c0 p0 p1 p2 p3

p1

c2

g1
p0
c1

g0
c0

Source: Parhami

Carry Look Ahead, multiple


levels
c0
0
1
2
C0
A0
B0
A1
B1
A2
B2
A3
B3

c0
g0

3
4
5

p0

C0
G0
P0
C1
G1

c1
g1

7
8

p1

c2
g2

10

p2

11

c3
g3

12
13

p3
C4

14
15

P1
C2
G2
P2
C3
G3
P3
C16

Cascaded Carry Look-ahead (16-bit):


Abstraction
C
L
A

C0

G0
P0
C1 = G0 + C0 P0

4-bit
Adder
C2 = G1 + G0 P1 + C0 P0 P1
4-bit
Adder
C3 = G2 + G1 P2 + G0 P1 P2 + C0 P0 P1 P2
G
P

4-bit
Adder
C4 = . . .

Carry Lookahead Generator Plumbing


Analogy
g0

p0
p1

g1

p1

p2
p3

g2

p2

P0
g3

G0

p3

4 Bit Hierarchical CLA


A B
3

A B
2

A B

2 bit CLA

A B
0

2 bit CLA

C
0

G1
1

G1

P1

2 bit
CLG

P1
0

P1
1

G2 = G01P11 + G
11

C4 = C0P01P11 + G1 0P11 + 1G1


P2 =1
P0 P1

8 Bit Hierarchical CLA


A B AB

A B AB
7

2 bit
CLA
G1
3

2 bit
CLA

P1
3

G1
2

G1

2 bit
CLA

P1

2 bit CLG

A B AB

A B AB

2 bit
CLA

P1

G1
0

P1
0

2 bit CLG

G2
1

C
8

P2

G2

2 bit CLG

P2
0

C
0

Design Trick: Guess (or


Precompute)
CP(2n) = 2*CP(n)

n-bit adder

n-bit adder

CP(2n) = CP(n) + CP(mux)

n-bit adder

Cout

n-bit adder

n-bit adder

Carry-select adder

Pipelined Ripple Carry Adder


Bn-1An-1
n1
FF

B2A2

B1A1

B0A0

FA

FA

FA

..
.

..
.

.. ..
. .

Cout FA

...
n3
FF

Sn-1

..
.
S2

n2
FF

S1

n1
FF

S0

Cin

Multiple Operand Addition


Many

applications require summation of many


operands
What is best way to compute this?
Inner Product

Multiplication


---------



---------------

a
x
x0
x1
x2
x3
p

a2 0
a2 1
a2 2
a2 3








----------------

p (0)
p (1)
p (2)
p (3)
p (4)
p (5)
p (6)
s

Terminology

Serial Implementation
Oi[n]

Si[n + log i]

Two Operand Carry


Propagate Adder
Register S

Si+1[n + log (i+1)]


Tserial-multi-add

= O(m log(n + log m))


= O(m log n + m log log m)

Therefore, addition time grows superlinearly with n when k is fixed


and logarithmically with k for a given n

Parallel Implementation
..
.

O1[n]O2[nO3[n]O4[nO5[n]O6[nO7[n]O8[n
]
]
]
]

CPA

CPA

CPA

CPA

Om1[n]

O(log m)
CPA
Tree

CPA

Om2[n]

CPA

CPA

Om- Om[n
]
1[n]

CPA

CPA

Can we do this faster?


CPA

CPA

CPA

Ttree-fast-multi-add

S[n +
log m]

= O(log n + log(n + 1) + . . . + log(n + log2m 1))


= O(log m log n + log m log log m)

Carry Save Adder (CSA)


O3[n]O2[n]O1[n]
n

n-bit Carry
Save Adder
n

C[n] S[n]
O3[n]O2[n]O1[n]

FA

C[n] S[n]

O3[2]O2[2]O1[2] O3[1]O2[1]O1[1]

n-bit
Carry
...
Save
Adder

FA

C[2] S[2]

FA

C[1] S[1]

Carry Save Adders


Cut

cin

FA

FA

FA

FA

FA

FA

FA

FA

FA

FA

FA

FA

Carry-propagate adder

cout
Carry-save adder (CSA)
or
(3; 2)-counter
or
3-to-2 reduction circuit

Carry propagate adder (CPA) and


carry save adder (CSA) functions in
dot notation.

Full-adder

Half-adder

Specifying full- and halfadder blocks, with their


inputs and outputs, in
dot notation.
Source: Parhami

Serial CSA Implementation


Ci[n + log i]

Si[n + log i]

Oi[n]

Carry Save Adder

Register C
Ci+1[n + log (i+1)]

Register C
Si+1[n + log (i+1)]

Tserial-csa-multi-add = O(m)
In the end there are two operands (C, S)

Final Reduction (2:1)


C[n] S[n] S[i]C[i-1]C[3] S[3] C[2] S[2] C[1] S[1]
Cout

Bit i-1

Bit 2

Bit 1

Carry Propagate Adder

HA

T[n+1]T[i+1]
T[n+2]

T[3]

T[2]

T[1]

O5[n]O4[n]
O6[n]:xxxx O6[n]
O5[n]:xxxx
+O4[n]:xxxx
S1[n:1]:xxxx
C1[n+1:2]:xxxx

O3[n]
O2[n]
O1[n]

CSA

C1[n+1:2]
C1[n+1:2]:xxxx
S1[n:1]:xxxx
+C1[n+1:2]:xxxx
S2[n+1:1]:xxxxx
C2[n+2:3]:xxxx
S1[n:1]:xxxx
S2[n+1:1]:xxxxx
+C2[n+2:3]:xxxx
S3[n+2:1]:xxxxxx
C3[n+2:2]:xxxxx

CSA

S1[n:1]

S1[n:1
C1[n+1:2]

CSA
C2[n+2:3]

S2[n+1:1]

CSA
C3[n+2:3]

S3[n+2

Carry Save Arithmetic

B C

CSA
Delay = 3 +
log2(M + 3)
3 = height of CSA
tree
M = bitwidth of
operands

CSA
C

CSA
S

Tree height =
log1.5(N/2)

CSA
S

CLA

Carry Save Arithmetic


RCA
RCA
RCA
RCA
RCA

(M
+1)
(M
+2)
(M
+3)
(M
+4)
(M
+5)

Using Ripple carry adders


(RCAs)

Delay = (M+5) +
4
Delay thru CSA network =
3 + log1.5(M + 3)

Example Reduction by a CSA Tree


8

12 FAs

6 FAs

6 FAs

5 4 3 2 1 0
7 7 7 7 7 7
2 5 5 5 5 5 3
3 4 4 4 4 4 1
1 2 3 3 3 3 2 1
2 2 2 2 2 1 2 1
--Carry-propagate adder-1 1 1 1 1 1 1 1 1

Bit position
62 = 12 FAs
6 FAs
6 FAs
4 FAs + 1 HA
7-bit adder

Representing a seven-operand
addition in tabular form.

4 FAs + 1 HA
7-bit adder
Total cost = 7-bit adder + 28 FAs + 1 HA

Addition of seven 6-bit


numbers in dot notation.

A full-adder compacts 3 dots into 2


(compression ratio of 1.5)
A half-adder rearranges 2 dots
(no compression, but still useful)
Source: Parhami

Wallace and Dadda Reduction


Trees

12 FAs

Wallace tree:
Reduce the number
of operands at the
earliest possible
opportunity

6 FAs

6 FAs

4 FAs + 1 HA
7-bit adder
Total cost = 7-bit adder + 28 FAs + 1 HA

Addition of seven 6-bit


numbers in dot notation.

h
2
3
4
5
6

6 FAs

n(h)
4
6
9
13
19

Dadda tree:
Postpone the
reduction to the
extent possible
without causing
added delay

11 FAs

7 FAs

4 FAs + 1 HA
7-bit adder
Total cost = 7-bit adder + 28 FAs + 1 HA

Adding seven 6-bit numbers


using Daddas strategy.
Source: Parhami

Generalized Parallel Counters


Multicolumn
reduction

(5, 5; 4)-counter
Unequal
columns

Dot notation for a (5, 5; 4)-counter and the


use of such counters for reducing five
numbers to two numbers.

Gen. parallel counter = Parallel compressor


(2, 3; 3)-counter

Source: Parhami

Compressors
Compressors

allow for carry in and carry outs

O4[i]O3[i]O2[i] O1[i]

O4[i-1]
O3[i-1]
O2[i-1]
O1[i-1]

FA

FA
Cin[i]

Cout[i]

Cin[i-1]

Cout[i-1]

[4:2]
Compress
or

FA
C[i] S[i]

Bit i

[4:2]
Compres
sor

FA
C[i-1]S[i-1]
Bit i-1

[4 : 2] Compressor Adder
O4[n]O3[n]O2[n]O1[n]
n

n-bit [4:2] Adder


n

C[n] S[n]
O1[n]
O3[n]
O2[n]
O4[n]

[4:2] Compressor
C[n]S[n]

O1[2] O4[1]
O1[1]
O3[2]
O2[2]
O3[1]
O2[1]
O4[2]

n-bit
[4:2]
Adder
...

[4:2] Compressor
C[2]S[2]

[4:2] Compressor
C[1]S[1]

Higher Order Compressors


O5[i]O4[i]O3[i]
O1[i]
O2[i]

FA

O5[i-1]
O4[i-1]
O3[i-1]
O2[i-1]
O1[i-1]

FA

FA

[5:2]
Compress
or Bit i

FA

FA
C[i] S[i]

[5:2]
Compress
or Bit i-1

FA
C[i-1]S[i-1]

Moorea Modem Receiver


Specification
Generalized multiple hypothesis test (GMHT)
Matching
Pursuit
Core

arg min
i

Matching
Pursuit
Core

Linear System
Optimizations

Note: 112
samples/symbol +
112 samples for
channel clearing.

Matching
Pursuit
Core
Matching
Pursuit
Core

Linear System Optimization

Linear systems ubiquitous in signal processing applications


cos(0)
cos(0)
cos(0)
y0 cos(0)
y cos( ) cos(3 ) cos(5 ) cos(7 )
8
8
8
8
1
y2 cos( 4 ) cos(3 4 ) cos(5 4 ) cos(7 4 )

y3 cos( 8 ) cos( 8 ) cos( 8 ) cos( 8 )

x0
x
1
x2

x3

We have developed many methods for optimization to


hardware, software, FPGA [ASAP04, ASPDAC05,
DATE06, ICCD06, Journal of VLSI Signal Processing07]
1D linear systems on previous slide, aka FIR filters

FIR Filter Implementations:


Multiply Accumulate Method

Convolution of the latest L input samples. L is the number of


coefficients h(k) of the filter, and x(n) represents the input time
series.

y[n] = h[k] x[n-k]

k= 0, 1, ..., L-1

X [n]
x

hL-1

hL-2

hL-3

h1

h0
y [n]

z-1

z-1

z-1

...

z-1

Disadvantages

Large area on FPGA due to multipliers and the fact that full flexibility of
general purpose multipliers are not required
Limited number of embedded resources such as MAC engines,
multipliers, etc. in FPGAs

z-1

FIR Filter Implementations:


Distributed Arithmetic
Summation of inner product:
Ak=
Xk

constant coefficients

= input data

Y = Ak X k
k =1

We can

write each inputn1data in twos


complement: X = X 0 + X b 2b
b=1

Substituting

thisB 1into the above yields:

b
Y = Ak X
k 0 + X kb 2

k =1
b=1
K

Exchange order of the summations:


B 1K
b K
Y =
Ak X kb 2 + Ak ( X k 0 )

b=1 k =1
k =1

FIR Filter Implementations:


Distributed Arithmetic
K
b K
Y =
Ak X kb 2 + Ak ( X k 0 )

b=1 k =1
k =1
B 1

From

previous slide:
How do we compute the bracketed term?
Multiply

a particular bit b of each of the inputs by the


A1, A2, , Ak
binary constants

Questions: Assume

we are looking at b=1 (LSB of


inputs), but this generalizes to any b
What

if each bit of Xk1 are 0 i.e. Xk1 = [000000]?

What

if X11 = 1 and remaining are 0? Xk1 = [000001]?

What

if X11 = 1, X21 = 1 and rest are 0? Xk1 = [0000011]?

Distributed Arithmetic
Looking

at summations in a different way

A1 ( X10 + X11 21 + X12 22 +L + X1( B 1) 2

( B 1)

( B 1)
1
2
n1

+A

X
+
X
2
+
X
2
+L
+
X
2
b
2 (
20
21
22
)
2 ( B 1)
Y = Ak X
k 0 + X kb 2 Y =

M
k=1
b=1
K

( B 1)

AK ( X K 0 + X K1 21 + X K 2 22 +L + X K ( B 1) 2

K
b K
Y =
+ Ak (X k 0 )
Ak X kb 2


b=1 k=1
k=1
n1

1
[A1 X11 + A2 X 21 + A3 X 31 +L + AK X K1 ] 2
+ [A1 X12 + A2 X 22 + A3 X 32 +L + AK X K 2 ] 22

Y=

M
( B1)

+ [A1 X1( B 1) + A2 X 2 ( B 1) + A3 X 3 ( B 1) +L + AK X K ( B 1) ] 2

+A1 (X10 ) + A2 (X 20 ) + A3 (X 30 ) +L + AK (X K 0 )

FIR Filter Implementations:


Distributed Arithmetic

X1b + X2b + +
XKb

Precision of constant bits


wide:
Usually equal to precision of
input data B
Address
Value
00
0
00
A1
00
00
A2
01
2K entry LUT
00
A1+. A2
.
10
.
11
.
.
.
.

11
11

A1+ A2 +
+AK

+
>>

FIR Filter Implementations:


Distributed Arithmetic

Advantages
Replaces multiplication with LUT
Coefficients stored in LUTs

Disadvantages
Performance limited as next input
sample processed only after every
bit of the current input sample is
processed
Increasing number of bits to be
processed has a significant effect on
resource utilization
Larger size scaling accumulator
needed for higher number of bits
Increases critical path delay

Address

Data

0000

0001

C0

0010

C1

1111

C0+C1+C2+C3

FIR Filter Implementations:


Distributed Arithmetic
The performance improved
by replication - process
multiple bits at a time
Significant effect on resource
utilization

More LUTs
Larger size scaling
accumulator

FIR Filter Implementations:


Add and Shift Method
X [n]
x

hL-1

z-1

hL-2

hL-3

h1

h0
y [n]

z-1

z-1

...

z-1

z-1

Idea: Constant Multiplication to


Shift/Add

Multiplication is expensive in hardware


Decompose constant multiplications into shifts and additions

Signed digits can reduce the number of additions/subtractions

13*X = (1101)2*X = X + X<<2 + X<<3

Canonical Signed Digits (CSD) (Knuth74)


(57)10 = (0110111)2 = (100-1001)CSD

Further reduction possible by common subexpression elimination

Up to 50% reduction (R.Hartley TCS96)

Introduction

Common subexpressions
= common digit patterns

4+, 4<<

F1 = 7*X = (0111)*X = X + X<<1 + X<<2


F2 = 13*X = (1101)*X = X + X<<2 + X<<3
0101

D1 = X + X<<2
F1 = D1 + X<<1
F2 = D1 + X<<3

=> X + X<<2

3+, 3<<

Good for single variable: FIR filters (transposed form)


Multiple variable? (DFT, DCT etc..??)

Linear Systems and polynomial


transformation
Y0
Y1

1 1 1 1

X0

2 1 -1 -2

X1

Y2

1 -1 -1 1

X2

Y3

1 -2 2 -1

X3

H.264
Integer
Transform

Decomposing constant multiplications

12+, 4<<

Y
Y00
Y
Y11

==
==

X
X00 ++ X
X11 ++ X
X22 ++ X
X33
X
X00<<1
<<1 ++ X
X11 -- X
X22 -- X
X33<<1
<<1

Y
Y22
Y
Y33

==
==

X
X00
X
X00

---

X
X11 -- X
X22 ++ X
X33
X
X11<<1
<<1 ++ X
X22<<1
<<1 -- X
X33

Linear Systems and polynomial


transformation
Y0
Y1

1 1 1 1

X0

2 1 -1 -2

X1

Y2

1 -1 -1 1

X2

Y3

1 -2 2 -1

X3

H.264
Integer
Transform

Polynomial Transformation

12+, 4<<

Y
Y00
Y
Y11

==
==

X
X00 ++ X
X11
X
X00LL ++ X
X11

Y
Y22
Y
Y33

==
==

X
X00
X
X00

---

++ X
X22
-- X
X22 --

++ X
X33
X
X33LL

X
X11 -- X
X22 ++ X
X33
X
X11LL ++ X
X22LL -- X
X33

H.264 Example

Select

Y
Y00
Y
Y11

==
==

X
X00 ++ X
X11
X
X00LL ++ X
X11

Y
Y22
Y
Y33

==
==

X
X00
X
X00

---

++ X
X22
-- X
X22 --

++ X
X33
X
X33LL

X
X11 -- X
X22 ++ X
X33
X
X11LL ++ X
X22LL -- X
X33

D0 = (X0 + X3)

H.264 Example
Y
Y00
Y
Y11

D
D00 ++ X
X11 ++ X
X22
X
X00LL ++ X
X11 -- X
X22 -- X
X33LL
Y
Y22 == D
D00 -- X
X11 -- X
X22
Y
Y33 == X
X00 -- X
X11LL ++ X
X22LL -- X
X33

Select

==
==

D1 = (X1 X2)

H.264 Example

Select

Y
Y00
Y
Y11

==
==

D
D00 ++ X
X11 ++ X
X22
X
X00LL ++ D
D11 -- X
X33LL

Y
Y22
Y
Y33

==
==

D
D00
X
X00

-- X
X11 -- X
X22
-- D
D11LL -- X
X33

D2 = (X1 + X2)

H.264 Example
Y
Y00
Y
Y11

D
D00 ++ D
D22
X
X00LL ++ D
D11 --X
X33LL
Y
Y22 == D
D00 -- D
D22
Y
Y33 == X
X00 -- D
D11LL -- X
X33

Select

==
==

D3 = (X0 X3)

Final Implementation
Extracting

4 divisors
8+, 2<<

D
D00 ==
D
D11 ==
D
D22 ==
D
D33 ==

X
X00 ++ X
X33
X
X11 X
X22
X
X11 ++ X
X22
X
X00 -- X
X33

Y
Y00 == D
D00 ++ D
D22
Y
Y11== D
D11 ++ D
D33LL
Y
Y22 == D
D00 -- D
D22
Y
Y33 == D
D33 D
D11LL
Original: 12+,
4<<
Rectangle
Covering:
10+, 3<<

FPGA FIR Filter Implementations:


Add and Shift Method
F1 = A + B + C + D
F2 = A + B + C + E
Extracting Common
Expression (A + B + C)

Unoptimized
Expression Trees

Extracting Common
Expression (A + B)

Optimization

Resource Utilization + Performance


Results
Filter Implementation Using
Add and Shift Method

Filter Implementation Using Xilinx


Coregen (PDA)

Filter
(# taps)

Slices

LUTs

FFs

Performance
(Msps)

Filter
(# taps)

Slices

LUTs

FFs

Performance
(Msps)

264

213

509

251

524

774

1012

245

10

474

406

916

222

10

781

1103

1480

222

13

386

334

749

252

13

929

1311

1775

199

20

856

705

1650

250

20

1191

1631

2288

199

28

1294

1145

2508

227

28

1774

2544

3381

199

41

2154

1719

4161

223

41

2475

3642

4748

222

61

3264

2591

6303

192

61

3528

5335

6812

199

119

6009

4821

11551

203

119

6484

9754

12539

205

151

7579

6098

14611

180

151

8274

12525

15988

199

Experimental Results
DA vs. Add and Shift Method

Experimental Results
DA vs. Add and Shift Method

Experimental Results
MAC vs. Add and Shift Method

Filter
(# taps)

Add Shift
Method

MAC
filter

Slices

Msps

Slices

Msps

264

296

219

262

10

475

296

418

253

13

387

296

462

253

20

851

271

790

251

28

1303

305

886

251

41

2178

296

1660

243

61

3284

247

1947

242

119

6025

294

3581

241

151

7623

294

7631

215

Experimental Results
MAC vs. Add and Shift Method

Experimental Results
MAC vs. Add and Shift Method

CSA CSE for Linear Systems


Y1 = X1 + X1<<2 + X2 + X2<<1 + X2<<2

D1 = X1 + X2 + X2<<1

Y2 = X1<<2 + X2<<2 + X2<<3

Y1 = (D1S + D1C) + X1<<2 + X2<<2


Y2 = (D1S + D1C)

Algebraic methods
Greedy

Iterative algorithm

Extracts

the best 3-term divisor


Rewrites the expressions containing it
SS
FF11=
=a
D
D2+
b++
D
D2c1CC+
+
+ded++ee
1 +
SS
FF22=
=a
D
D2+
b++
D
D2c1CC+
+
+dfd++f f
1 +

S
>> D12 = a
D1+
+
b+
D1 C +
c
d

Terminates

when there are no more common


subexpressions

Experimental results

Comparing # of CSAs

Average 38.4% reduction

Experimental results

FPGA synthesis

Virtex II FPGAs
Synthesized designs and performed place & route

Avg 14.1 % reduction in #Slices and Avg 12.9%


reduction in # LUTs
Avg 5.7% increase in the delay

Conclusions
Optimized

acoustic modem by focusing on


channel estimation and FIR filters
In depth study of parallelization, number
representation, arithmetic, and linear system
optimization

Matching
Pursuit
Core

arg min
i

Matching
Pursuit
Core

Note: 112
samples/symbol +
112 samples for
channel clearing.

Matching
Pursuit
Core
Matching
Pursuit
Core

You might also like