Synthesis For Low Power: A New VLSI Design Paradigm For DSM: Ajit Pal
Synthesis For Low Power: A New VLSI Design Paradigm For DSM: Ajit Pal
Ajit Pal
Professor
Department of Computer Science and
Engineering
Indian Institute of Technology Kharagpur
INDIA -721302
Outline
• Why Low Power?
• Sources of power dissipation
– Dynamic Power
– Static Power
• Degrees of Freedom
• Low Power Techniques
– Supply Voltage Scaling Techniques
– Minimizing switching capacitances
– Leakage Power reduction Techniques
Ajit Pal IIT Kharagpur
Why Low-power?
• Until recently performance has been
synonymous with circuit speed or
processing power, e.g. MIPS or MFLOPS.
• Implementation involved Area-Time
tradeoff. Power Consumption = k.A.f,
where k= 0.063 W/cm2.MHz, A is the area
in cm2 and f is the frequency in MHz.
• Power consumption were of secondary
concern.
Pdynamic = α L ⋅ C L ⋅ VDD
2
⋅ f + ∑α i ⋅ Ci ⋅ VDD ⋅ (VDD − VT )
i
Charging
Pull-up current Pull-up
Pull-up
network network
network
ON OFF
OUT IN IN
IN
Pull-down Pull-down
Pull-down
CL network network CL
network CL
OFF ON
Discharging
current
Ajit Pal IIT Kharagpur
Switching Power
¾During transition of the output from 0 to Vdd, the
energy drawn from the power supply is given by
Vdd
dV 0
E0→1 = ∫ p(t )dt = ∫V .i(t )dt
0
dd i(t ) = C L
dt
Vdd
Substituting this we get E0→1 = Vdd ∫ CL dV0 = CLVdd2
If a square wave of 0
repetition frequency f (I/T)
1
is applied at the input then Pd = .C LV dd2 = C LV dd2 f
the power dissipated per T
unit time is given by
Ajit Pal IIT Kharagpur
Contd…
Dynamic Power Dissipation
¾ Short Circuit Power
Dissipation
As input changes slowly,
power dissipation takes ISC
place even when there is no
load or parasitic capacitor.
This is known as the short
circuit current.
Note that the short circuit
power dissipation is greatly 1 kτf
affected by the power I sc = ⋅ ⋅ (VDD − VT )3
supply scaling and is also 12 VDD
proportional to the
frequency and rise/fall time
of the input signal.
Ajit Pal IIT Kharagpur
Short Circuit Power
Dissipation
⎡ t2 ⎤
⎢ ∫ i (t )dt ⎥
4
I mean =
T ⎢⎣ t 1 ⎥⎦
For the nMOS transistor is operating in the saturation region
⎡ t2 β ⎤
I mean =
4
⎢∫ (V in (t ) − V t ) 2
⎥ dt Contd…
T ⎢⎣ t 1 2 ⎥⎦
This results in β τ
Imean = (Vdd −2Vt ) .
3
12Vdd T
Short circuit power is given by
β
Psc = Vdd .I mean = (Vdd − 2Vt ) τ . f .
3
12
Ajit Pal IIT Kharagpur
Glitching Power
Vdd Vdd
Vdd
V out ="0"
n+ n+ p+ p+ n+
Drain leakage
n-well
Reverse leakage
current
p-type substrate
EV app ⎛ 3
Eg 2 ⎞
J b −b = A exp ⎜ − B ⎟
1
Eg 2 ⎜ E ⎟
⎝ ⎠
Where,
2m * q 3 and B =
4 2m *
A = 3qη
4π 3 η 2
1
10 Active Power
0
10
Power (W)
-1
10
-2 Stand by Power
10
-3
10
-4
10
-5
10
-6
10
¾ Power optimization
approaches at the
high-level are
significant since
research results
indicate that higher
levels of abstraction
have greater
potential for power
reductions.
Normalized Delay
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0 1 2 3 4 5 0 1 2 3 4 5
Vdd Vdd
Ajit Pal IIT Kharagpur
Device Feature Size Scaling
GATE
Source Drain
1 t ox ' = t ox / S
2 N D' = N D × S
5
OXIDE 1 3 L′ = L / S
4 2 3 2
4 X ′j = X j / S
6 N A' = N A × S
LATCH
registers supplies two A
operands to a Adder. 16 16 bit
Adder
Delay of the critical 16
path of the adder is 10
LATCH
nsec. Operating B
LATCH
multiplier has A
been duplicated
16-bit
twice, but the Adder
LATCH
input registers
have been
clocked at half the 16 MUX
16
LATCH
frequency of fref.
This helps to 16
LATCH
B
the critical path
delay is not more
than 20 nsec. fref/2
2
⎛ V ref ⎞ f
P par = 2 . 2 C ref .⎜⎜ ⎟⎟ × ref
•The estimated dynamic power is ⎝ 2 ⎠ 2
LATCH
LATCH
s8-15
s0-15 stage. Therefore, the 8-
a8-15 8-bit bit adder will operate
adder
b8-15 at a clock frequency of
100 mHz with a
reduced power supply
voltage of Vref /2.
2
b8-15
LATCH
MUX
LATCH
0 s0-7
s0-15
s0-15 Both power supply
a0-7 8-bit
b0-7
adder and frequency of
s8-15
a8-15
operation are reduced
8-bit
b8-15
adder fref to achieve substantial
overall reduction in
fref/2
power dissipation.
⎛ f ref ⎞
Parpipe = (2.5C ref )(0.4Vref ) ⎜⎜ ⎟⎟
2
Estimated power
⎝ 2 ⎠
Ajit Pal IIT Kharagpur
Dynamic Voltage Scaling (DVS)
2.5
Destructive
Processor voltage
2.0
1.5 Operational
1.0
0.5
Non-functional
1.2
0
0
1.0 No voltage scaling 74 103 133 162 192 221
CPU clock frequency
Normalized
0.8
Energy
0.6
V(r) f(r)
λ1 w
Basic scheme of DVS
λ2 λ Variable Voltage
Processor μ (r )
X
Μ λn
Task Queue
S
Low-pass
Filter
VF
DC/DC Converter Pulse-width
V0 RL
Reference
Voltage
Comparator
The Code
Morphing
software
mediates
between
x86
software
and the
Crusoe
Processor
2.43
Chat 1.54
2.51
Browse 1.64
2.76
Boyer 2.09
2.42
Nand 1.57
2.57
Reducer 1.71
2.64
Qsort 1.33
2.46
Fastqueens 1.03
1.32
Chat 1.2
1.32
Browse 1.4
1.76
Boyer 1.72
1.25
Nand 1.16
1.47
Reducer 1.4
1.32
Qsort 1.25
0.85
Fastqueens 0.91
0 0.5 1 1.5 2
Bit Transitions per Instruction Executed
Ajit Pal IIT Kharagpur
Bus Inversion Coding
¾ It is a redundant coding scheme where m=n+1
¾ If the ith data word is Si, then either Si or ~Si is
transmitted depending on which would result
in fewer no of bit transitions
¾ An extra bit P encodes the polarity of the data
word
¾ The coding technique works better for smaller
values of n
For n=2, switching activity reduction is 25%
For n=32, switching activity reduction is 11%
INPUT f
¾ Advantages pull
down
CL
Robust
Lower switching ¾ Disadvantages
activity Larger number of transistors
(larger chip area and delay)
Good input/output Spurious transitions (glitch) due
decoupling to finite propagation delays
No charge sharing leading to extra power dissipation
problem and incorrect operation
Short circuit power dissipation
Availability of matured
logic synthesis tools Weak output driving capability
and techniques Large number of standard cells
requiring substantial engineering
effort for technology mapping
Ajit Pal IIT Kharagpur
VDD
f precharge evaluation
¾ Advantages INPUT
pull
down
CL f H|H L
φ=1
Combines the
network
f H
φ=0
advantages of low φ
¾ Sneak path
Vdd Vdd-mVt
¾ Long chain of pass transistors Vdd-Vt
T1 T2 T3 Tn
B C CL
A D
n(n + 1)
1 0 T = 0.69 R.C L
2
Ajit Pal IIT Kharagpur
Experimental Results
¾ Static CMOS circuits have been realized using
Berkeley SIS tool (script.rugged to optimize the netlist
and technology mapping with 44-2.genlib and option
of minimum area)
¾ A large number of benchmark circuits are realized
using the three logic styles with C/C++ programming
in Sun system
¾ Requirements of area are approximated with the
number of transistors
¾ Estimation models for calculating delay and switching
power dissipation for the circuits with three different
logic styles have been proposed and their accuracies
are verified with Spice and Design Analyzer in
Cadence
¾ MOSFET parameters are used from 0.18mm process
technology Ajit Pal IIT Kharagpur
Experimental Results
1 2 3 4 5 6 7 8 9 10
Static CMOS circuits Dynamic CMOS circuits PTL circuits
Benchma Dela Powe Powe
rk Area
Delay Power Area y r Area Delay r
(#Transistor)
(ns) (mW) (ns) (mW) (ns) (mW)
C432 692 3.32 122 581 1.82 88 546 2.08 91
C499 1880 2.23 367 1506 1.69 248 1428 1.62 167
C880 1412 2.21 293 1249 1.79 166 988 1.18 267
C1355 1880 2.61 400 1603 1.51 280 1203 1.04 379
C1908 1756 2.91 367 1689 1.80 251 1088 1.57 298
C2670 1804 2.94 493 1584 1.74 395 1010 1.54 449
C3540 4214 4.57 409 2815 2.63 314 2782 2.58 294
C5315 7058 3.65 830 5970 2.69 515 5364 1.62 778
C6288 11222 11.84 409 8716 4.64 504 6060 4.69 445
C7552 8214 2.99 1604 7328 1.98 1173 5682 1.66 1328
Average % reduction compared to static
-16% -37% -33% -47%
CMOS circuits -25% -17%
10000
8000
#Transistor
Static CMOS
6000 Dynamic CMOS
PTL
4000
2000
0
C432
C499
C880
C1355
C1908
C2670
C3540
C5315
C6288
C7552
Ajit Pal IIT Kharagpur
Comparison of Delay
14
12
10
Delay (ns)
Static CMOS
8
Dynamic CMOS
6 PTL
0
C432
C499
C880
C1355
C1908
C2670
C3540
C5315
C6288
C7552
6000
Switching energy (fJ)
5000
2000 PTL
1000
0
C1355
C1908
C2670
C3540
C5315
C6288
C7552
C432
C499
C880
Isub
[Normalized]
0.80 10e2 40
0.60 10e1 30
0.40 10e0 20
Delay
0.20 10e-1 10
0.00 10e-2
0.0 0.2 0.4 0.6 0.8
Threshold Voltage
Ajit Pal IIT V th [V]
Kharagpur
Threshold Voltage (VT) Scaling
Scale down the threshold voltage for low
voltage low power circuits to increase
performance
VT ↓ = Delay ↓ + Ileakage ↑
Low -VT : Provides high performance
VT ↑ = Delay ↑ + Ileakage ↓
High -VT : Reduces subthreshold leakage
0.2VDD ≤ VT ≤ 0.5VDD
Ajit Pal IIT Kharagpur
Threshold Voltage Scaling
• Fabrication of multiple threshold voltages:
•Multiple channel doping
•Multiple Oxide thickness
•Multiple channel length
•Multiple body bias
•Various Approaches:
•Variable-threshold-voltage CMOS
(VTCMOS) approach
•Multi-threshold-voltage CMOS
(MTCMOS) approach
•Dual-Vt assignment approach
SL Q1 SL Q1
VDDV VDDV
Circuit with
low-Vth
Transistors
GNDV GNDV
SL Q2 SL Q2
MTCMOS
Ajit Pal IITCircuit Scheme
Kharagpur
MTCMOS Performance
• Simulation results
5.0 2.0
Conv. CMOS
Conv. CMOS
(full H-Vth)
Normalized Energy
Normalized Delay
3.0 1.2
MTCMOS
2.0 0.8 MTCMOS
a f
Ajit Pal IIT Kharagpur
c
e nodes in critical
d path (low-Vth)
a f
a f
a f
a f
Algorithm
1. Assume low-VT<high-VT<0.5VDD
2. Initialize all nodes with high-VT
3. Compute the critical path(s)
4. Using DFS traversal, assign low-VT
to a node on the critical path
5. Go to Step 3 until all the nodes on
the critical path are assigned with
low-VT
Ajit Pal IIT Kharagpur
Delay-Constrained Dual-VT Assignment
Repeat the
assignment with 20
different high-VT
Leakage power (μ W)
15
(0.2VDD<high-
VT<0.5VDD ) for 10
which maximum
number of nodes 5
hence minimum 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55
VT in volt
leakage power is
possible
Ajit Pal IIT Kharagpur
Experimental Results
Comparison of our results with [Wei+99]
With approach [Wei+99] Our approach
%Redn %Redn
in %Redn CPU in %Redn CPU
Benchmark
#Transistor standby in total time #Transistor standby in total time
leakage power (s) leakage power (S)
power power
C432 278 59.65 16.52 20 348 87.35 26.17 36
C499 604 51.09 7.18 118 796 64.45 12.68 174
C880 1126 84.87 14.07 55 1208 88.65 19.41 89
C1355 1232 49.36 8.51 198 1346 59.95 13.15 346
C1908 1430 76.21 15.46 225 1684 83.45 21.75 412
C2670 2736 81.24 19.27 269 3092 92.96 24.80 485
C3540 3430 85.60 21.43 301 3698 90.42 32.29 541
C5315 5432 83.12 18.44 342 5516 89.69 31.05 619
C6288 5768 43.38 19.89 564 8950 83.69 45.42 890
C7552 7102 76.41 20.35 387 7786 87.65 22.36 609
69.01% 16.11% 82.82% 24.92%
200
50
250
200
150
100
50
0
10000
transistors
8000
No. of
2000
C PU tim e in sec.
1500
1000
500
0
Ajit Pal IIT Kharagpur
List of publications
1. Debasis Samanta, Ajit Pal, Synthesis of Low Power High
Performance Dual-VT PTL Circuits, Proc. 17th International
Conference on VLSI Design, 2004, pp.85-90, Mumbai, January
2004
2. D. Samanta, Ajit Pal, Logic Styles for High Performance and Low
Power, Proceedings of the 12th International Workshop on Logic
and Synthesis, 2003 (IWLS-2003), pp. 355-362, May, 2003
3. D. Samanta, M. C. Dharmadeep, and Ajit Pal, Synthesis of High
Performance Low Power PTL Circuits, Proc. ASP-DAC 2003,
Kitakyusyu, Japan, pp. 209-212, January 2003.
4. D. Samanta, and A. Pal, Synthesis of Dual-VT Dynamic CMOS
Circuits, Proc. VLSI Design 2003, New Delhi, India, pp. 121-128,
January 2003.
5. D. Samanta, N. Sinha, and A. Pal, Synthesis of High Performance
Low Power Dynamic CMOS Circuits, Proc. ASP-DAC/VLSI Design
2002, Bangalore, India, pp. 99-104, January 2002.
6. D. Samanta, and A. Pal, Optimal Dual-VT Assignment for Low-
Voltage Energy-Constraint CMOS Circuits, Proc. ASP-DAC/VLSI
Design 2002, Bangalore, India, pp. 193-198, January 2002.
7. N. Tripathi, A. Bhosle, D. Samanta, and Ajit Pal, Optimal
Assignment of High-VT for Synthesizing Dual-VT CMOS Circuits,
Proc. VLSI Design 2001,Ajit
Bangalore, India, pp. 227-232, January
Pal IIT Kharagpur
2001.
Thanks!