Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Improve Clock Tree Efficiency For Low Power Clock Tree Design

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Improve Clock Tree Efficiency for Low Power Clock Tree Design

Zhe Ge*, Juan Fu, Peidong Wang, Lei Wang


Microcontroller NXP, Suzhou 215011, China
* Email: glen.ge@nxp.com

Abstract—Low power design is critical in today’s chip carefully with limited clock buffers, while clock skew
design. Clock tree takes much of chip power. “Clock may be not so critical sometimes. (* all “buffers” here
tree cost” is introduced to help design low power are similar for inverters)
clock tree. Five methods are proposed to reduce Based on this point, a new clock metric – “clock
“clock tree cost” and improve clock tree efficiency. tree cost (CTC)” is proposed to measure clock tree
They include clock sink depth check, redundant scan efficiency. CTC is defined as average clock buffer
mux check, redundant clock gating cell check, number per 100 leaf pins for one clock. The lower for
CCOPT (Clock Concurrent Optimization) and simple CTC, the higher for clock tree efficiency.
clock tree, and low threshold voltage tree. By these
ways, clock tree efficiency is improved and clock tree CTC = (buffer_numbers ÷ leaf_pin_numbers)× 100
power is reduced.
For a good and heathy clock tree, CTC is low,
Keywords—Low power, clock tree, efficiency which means clock tree can be built with a small number
of clock buffers. Then clock power is low due to less
switching power on clock buffers. On the other hand, if
1. Introduction CTC is high, reasons should be found if there is clock
Low power design is more and more important in tree specification issue or clock structure shortcoming.
today’s SoC design. Low power means longer battery
life, more reliability, less package cost, and there’s huge 3. Improve Clock Tree Efficiency For Low Power
demand for portable devices and IoT applications. Five ways are proposed to reduce “clock tree cost”
Much of chip power is occupied by clock tree, and improve clock tree efficiency for low power. Test
which always has high toggle rate. To reduce clock tree chips are based on CMOS 90nm ARM core MCUs.
power, one common method is clock gating. Clock is
gated when clock is not used or all clock leaf cells are in 3.1 Sink depth check
hold state. But even then, clock tree power is still high. In chip design, especially for MCU design, kinds of
Low power clock design is studied widely for years. different peripherals are connected to bus. This makes
Some are for new clock gating methods, such as low efficiency for clock tree synthesis. In ideal clock
data-driven gating cell to improve clock gating structure, clock sink pin depth is same (Fig 1). This
efficiency [1]. Some are for new clock gating cell design, makes a very high clock tree efficiency. But in real chip
which reduces power of gating cell itself [2]. Some are (Fig 1), different sink pin depth reduces clock tree
for clock gating cell split, merge and distribution efficiency.
algorithm [3] [4]. And some are for register placement
and merge methodology [5].
In this paper, clock tree efficiency is proposed
when doing low power clock tree design. The rest of
paper is as follows. Section 2 introduces “clock tree
cost” to measure clock tree efficiency. Section 3 is main
part, which presents five ways to improve clock tree
efficiency and power. Section 4 gives a summary.

2. Clock Tree Efficiency


In regular clock tree design, clock tree metrics are Figure 1. Different clock structure
mainly clock skew, clock transition and clock latency. If
these metrics are OK, clock tree is done. This is true for For this point, one script is made to check sink
high performance design, which is timing critical, but depth distribution for each clock before clock tree
this is never correct for low power design. In low power synthesis. Figure 2 shows one example. From figure 2,
design, clock buffer* number and clock buffer gate count sink depth of core clock is mainly gathered on depth 4-6,
must be considered. Clock tree should be designed which makes good efficiency. But sink depth of bus
clock disperses from 4 to 12. This makes difficult for

978-1-4673-9719-3/16/$31.00 ©2016 IEEE


clock tree to achieve good efficiency, resulting in “clock Figure 3 shows an example of redundant scan mux.
tree cost”. Mux A and B are repeated mux at different modules.
Usually, bus clock structure is more complicated Mux B and C are repeated mux at different hierarchies.
than core clock, as there’re many different peripherals. This makes complicated for function and scan clock tree,
This paper suggests to check deep depth sinks carefully and many unnecessary buffers are added on path AB and
with design team, seeing if it can be improved from path BC.
design structure, such as avoiding too many mux in To check this problem, one “redundancy scan mux
modules. Physical designer should also check if there’s report” algorithm is proposed.
incorrect or bad style clock specification.
Algorithm 1: redundancy scan mux report
Input: gate level netlist
Output: redundant scan mux list
for each Mux in design do
stop  0
Drivers  Get driving instances for Mux d0
pin
while (stop is 0)
for each Driver of Drivers do
if (Driver is mux) then
if (Driver and Mux share
similar scan enable net) then
Return this mux pair
end if
elseif (Driver isn’t hard block) then
Figure 2. Sink depth distribution for core and bus clock Tmp_lists  Get driving
instances for Driver
After compressing deep depth sinks (depth 9-12), end if
clock buffer number is reduced from 499 to 442 and end for
“clock tree cost” is reduced from 6.91 to 6.12, which is if (Tmp_lists is empty) then
11.4% improvement. stop  1
else
3.2 Redundant scan mux check Drivers  Tmp_lists
Most of flip-flops need to connect to scan chains end if
for design test purpose, so there may be many scan mux end while
in design. Usually chip design is hierarchical and module end for
based, so one clock may be mux to scan clock multiple
times in different modules and different hierarchies by This script is implemented on test chip and
different designers. This makes no problem for chip dozens of redundant scan mux are found. After
function, but these redundant scan mux bring re-design and remove these extra scan mux, clock tree
reconvergence points and extra clock buffers, so it harms efficiency improves 12.8% (Table 1).
clock tree efficiency and increases chip power.
Table 1. “Redundant mux remove” test result
sinks buffers improve buffer improve
area
original 7184 648 7546
improved 7184 565 12.8% 6753 10.5%

3.3 Redundant clock gating check


Besides cascade redundant clock gating (CG),
there’re also parallel redundant CG cells in design,
due to design and tools’ reasons. To save power, these
CGs should be merged to share latch and control logic
Figure 3. Redundant scan mux in CG (Fig 4). This also helps to reduce total input
capacitance for CG cells, so gets more power benefit
when clock is off. without balancing any clock skew. It is verified (Table
4) this is an effective method to improve power for
low performance clock domain, without much impact
on timing.

Table 4. Simple clock tree vs Normal clock tree


sinks normal tree simple tree Improve
buffers buffers
DGO clk 119 38 26 32%
LPO clk 56 28 6 79%

3.5 Low threshold voltage tree


Figure 4. Redundant clock gating In multi-Vth design, clock tree is suggested to
build with low Vth cell, which has more driven
For this issue, one script is created and used to strength. Though it increases leakage power, it
check redundant CG for test chip. Table 2 shows 9% enhances clock tree efficiency and improves dynamic
gate count reduction for core clock and 4% reduction power more. Table 5 shows it improves both cell
for bus clock after merging redundant CGs. power and net power. Low efficiency clock is
benefited more.
Table 2. “Redundant CG remove” test result
core clock Bus clock CG cell Table 5. Normal Vt tree vs High Vt tree
area area number
total buffer clock core bus
original 52155.13 51783.98 5122 buffers gate net clock clock
CG remove 47384.57 49721.52 4551 count length buffers buffers
improve 9.1% 4.0% 11.1% hvt tree 1083 7510 809854 213 464
svt tree 900 6054 783053 186 339
3.4 CCOPT and simple clock tree improve 17% 19% 3% 13% 27%
The above three methods are mainly for design
check, and the left two ways are based on physical 4. Summary
implementation. Low power clock tree design should be built with
CCOPT (Clock Concurrent Optimization) is a limited buffers/inverters. This paper suggests to build
popular method for clock tree synthesis. It checks high efficient clock tree with low “clock tree cost”.
both clock path and data path to build clock tree. Five methods are proposed to improve clock tree
Clock skew is not so critical as long as data timing is efficiency and reduce chip power. All these methods
OK. It can save a lot of clock buffers as clock is not are verified by test chip.
balanced for zero skew. Table 3 shows clock
efficiency is improved with CCOPT method. References
[1] S. Wimer and I. Koren, “Design Flow for Flip-Flop
Table 3. CCOPT vs CTS Grouping in Data-Driven Clock Gating”, IEEE
buffers buffer gate all clock Transactions on Very Large Scale Integration (VLSI)
count logic area Systems, p. 771-778 (2014)
CTS 1092 6968 55894 [2] A. R. Durgam and K. Choi, “Optimized clock gating cell
for low power design in nanoscale CMOS technology”,
CCOPT 887 4633 41957
Quality Electronic Design (ASQED), p. 85-88 (2013)
improve 18.77% 33.51% 24.93% [3] W. X. Shen, Y. C. Cai, X. L. Hong and J. Hu, “An Effective
Gated Clock Tree Design Based on Activity and Register
CTS can be even more aggressive than CCOPT. Aware Placement”, IEEE Transactions on Very Large Scale
For example, there’s always-on power domain in Integration (VLSI) Systems, p. 1639-1648 (2010)
many MCU designs, including some timers and [4] S. K. Teng and N. Soin, “Low power clock gates
optimization for clock tree distribution”, Quality Electronic
control logic. Performance is low for this domain, but Design (ISQED), p. 488-492 (2010)
power requirement is high. For normal clock tree [5] S. H. Wang, Y. Y. Liang, T. Y. Kuo and W. K. Mak,
synthesis, clock efficiency is low as too many mux in “Power-Driven Flip-Flop Merging and Relocation”, IEEE
timer modules. So this paper proposes “simple clock Transactions on Computer-Aided Design of Integrated
tree” to only fix clock transitions for these clock trees, Circuits and Systems, p. 180-191 (2012)

You might also like