Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Recursive Hierarchical DFT Methodology With Multi-Level Clock Control and Scan Pattern Retargeting

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Recursive Hierarchical DFT Methodology with

Multi-level Clock Control and Scan Pattern


Retargeting
Dan Trock Rick Fisette
Annapurna Labs Ltd. Mentor Graphics Corp.
Yokneam, Israel Boston, MA
danit@annapurnalabs.com rick_fisette@mentor.com

Abstract- The value of hierarchical DFT methodologies is well II. PREVIOUS WORK
established today. Generating scan patterns at the core level
to determine coverage and debug pattern issues is a common The concept of core-based test approach has been around
practice in today's large SoCs. However, the pattern mapping for more than a decade [1,2]. Various core test access
process from core to chip level has remained an error-prone mechanisms (TAMs) have been suggested [3] and basic
task that requires significant verification effort. This leads hierarchical methodology described [4-8]. A core wrapper
many SoC designers to regenerate patterns at the SoC level and a core test language (CTL) have been standardized [9]
and not fully realize the benefits of a hierarchical approach. and a non-intrusive core wrapping technique suggested
Since the introduction of automated pattern mapping by [10]. Implementations of core-based testing have resulted
EDA tool vendors, leveraging the hierarchical approach has
become easier. In this paper we present a multi-level clock
in lower ATPG runtime and have reduced memory
control architecture that enables deployment of recursive footprint, depending on how many cores are grouped
hierarchical DFT with scan pattern retargeting at multiple together, but still required that ATPG be run from the top
levels. The resulting silicon-proven implementation level of the design [5-8]. With the recent advent of built-in
demonstrates ATPG runtime speed-up and memory savings pattern retargeting automation within EDA tools [16,17]
of 10X. Test coverage is maintained and test pattern count is deploying a fully hierarchical flow has become easier and
reduced while DFT tapeout schedule is significantly can offer additional benefits. However, new challenges
improved. have arisen. As the shared (non-intrusive) wrapping
I. INTRODUCTION technique becomes the preferred choice in today’s SoCs,
the importance of supporting both Internal and External
As the design sizes of modern semiconductor devices
mode testing grows, and this has implications on clock
increase, it becomes more and more difficult to complete
control. In addition, when cores have clocking
all the necessary Design-For-Test (DFT) tasks for tapeout
relationships such as one core clocking another, special
without affecting the schedule. Typical device sizes have
care must be taken with regard to clock control, in order to
grown in terms of flop count from several hundreds of
leverage an optimal hierarchical test. This is the focus of
thousands to multi-million. An at-speed ATPG runtime on
the current work and we demonstrate how these challenges
a four-million flops flat design can take 2–3 weeks and
have been addressed.
require a dedicated server with more than 200GB of
memory. By the time the design and the timing closure III. CHIP-LEVEL ARCHITECTURE
begin to stabilize, usually only a few weeks remain until
tapeout, leaving little time to complete the DFT tasks. A. Wrapper Insertion
These challenges have required that a scalable There are two types of wrapper cells—dedicated and
methodology such as hierarchical DFT be adopted going shared (non-intrusive). As described in [6,7,10] shared
forward. This paper will present a hierarchical DFT wrapper cells (which reuse existing functional flops near
framework that includes clock control and pattern the PIs and POs) are the preferred wrapping technique,
retargeting at multiple hierarchy levels. Design because they have no area and timing overhead. The
considerations and implementation details will be identification and stitching of shared wrapper cells is now
reviewed. The paper is organized to cover: (1) previous well supported by scan insertion tools.
work and new challenges, (2) a description of DFT
features implemented in the case study design, in particular B. Clock Control and Distribution
the multi-level clock control, and (3) a presentation of data How capture clocks are controlled is an important aspect
showing the reductions in ATPG runtime, memory of pattern retargeting. On-chip clock control (OCC) has
footprint, and tapeout schedule. been in common use for ATPG for some time [4] and has
been shown to provide better ATPG results [15]. For
pattern retargeting, OCC logic located inside the core is
particularly beneficial [14]. However, in External mode,

DOI:10.3850/9783981537079 0810 2016


c EDAA. 1128
when one core launches and another captures, a C. Cores with Master-Slave Clocking Relationship
synchronized launch-capture pair of clock pulses is In the typical case, a core receives a free-running clock
necessary for the two cores. This scenario is particularly directly from the SoC PLL for the functional mode
important for achieving high at-speed test coverage with operation. However, in some cases there are cores driving
shared wrappers, as more logic is left outside the wrappers clocks of other cores, e.g. serializer-deserializer (SerDes)
and can only be tested in this mode. Producing such a sync cores typically used for high-speed communication, thus
pair is difficult when each core modulates the clock constituting a master-slave clocking relationship between
independently with an OCC of its own, since what serves cores and creating a dependency between the Internal
as a trigger to release the clock is typically an mode pattern sets of these cores.
asynchronous event. This may result in pairs of pulses that
are not properly aligned. In such a case, during Internal mode ATPG, a free-running
test clock from the SoC PLL can be provided directly to
What this paper proposes (and demonstrates with silicon both master and slave cores, as shown in Figure 5 (red
results) is implementation of a hierarchy of clock control at highlighted net). This is necessary in order to avoid
multiple levels, where each core modulates the clock with dependencies and enable efficient retargeting and merging
its own OCC only for the Internal mode of that particular of Internal mode pattern sets.
core, while for the parent-level pattern set an upstream
parent OCC takes control and feeds a synchronous capture
clock to all child cores. During the ATPG process at the
parent level, the child-level OCCs only pass the clock
through, as it is already modulated by the parent OCC.
Figure 4 illustrates this concept for a single clock domain.
The blue nets represent the functional clock trees at the
child and parent levels. For Internal mode at the child
level the OCC in each core drives the core level clock tree.
For External mode, the parent OCC drives the parent-level
clock tree, which in turn drives the clocks of each core.
The timing closure is done in functional mode, with all
OCCs set to bypass. When testing at the child core level, Fig. 5 – Direct delivery of free-running clock to slave core for Internal
the OCC of the child core drives the clock tree similarly to mode test.
functional mode and only a small delta in latency is added,
which is common to any interacting flops within the
During External mode ATPG, when the interface between
isolated core under test. When the parent hierarchy level is
the cores is being tested, the test clock to both master and
under test, all downstream child core OCCs are set to
slave cores is provided by the parent OCC. The internal
bypass and the clock tree seen by the active parent-level
OCCs of both master and slave cores just let the clock pass
OCC is identical to functional mode, so the same applies at
through, as it is already modulated. Thus, clocking of the
the parent level. This architecture can be extended to as
interface between master and slave in External mode is
many levels of hierarchy under test as needed.
similar to the functional mode, as shown in Figure 6 (red
highlighted net).

Fig. 6 – Clocking of interface between master and slave cores in External


mode test similar to functional mode.

Fig. 4 - Child level OCCs for Internal mode, parent-level OCC for
External mode

2016 Design, Automation & Test in Europe Conference & Exhibition (DATE) 1129
D. Choosing Hierarchy levels for ATPG properly determine how much h more efficient the
In the usual case, two levels of hierarchyy for test – core hierarchical approach is, as shown below.
b
level and top level - are enough for scaliing with present
day SoCs. However, there are cases w when cores are
instantiated multiple times in embeddded hierarchies,
contributing significantly to ATPG runtim mes and memory
footprint at those hierarchies. One such exxample is a CPU.
A CPU core is normally instantiated multipple times. In this
testcase design it has four instances. Beecause the CPU
patterns can be replayed on the four CPU U instances at the
CPU-cluster level, the ATPG process forr just the parent
level requires 10x less memory and runtim me compared to
ATPG for the full cluster netlist withh all four CPU
instances (results are design-dependent). T This is achieved
by having a wrapped core inside a wrapped core and
applying the flow recursively. For cores llike CPU, which
can be used in other projects, the beneffit of generating Fig. 7 – High-level core layout and DFT chaannel allocation
patterns at the core level is even greateer, as it will be
applied to more than one design. Table 1 summarizes the ATPG statistics and includes
coverage numbers for the standaard stuck-at (DC) and
E. Physical Design transition delay (AC) pattern typees. The AC type ATPG
There is some additional effort required frrom the physical run times and memory footprints arre typically much larger
design engineer to support the hierarchical methodology. than those of the DC, thus only AC C data are shown. In the
First, the wrapper chains, grouped accordinng to their clock case of Internal mode core statisttics, the longest ATPG
domain and edge, must support shifting att full speed. This runtime, highest pattern count, and largest memory
is to enable proper launching from the w wrapper cell into footprint across all cores are shown
n.
and out of the core during at-speed test and to achieve TABLE I
maximal test coverage. However, since this should not ATISTICS
FULL CHIP ATPG STA
come at the expense of the functional timming closure, it is DC AC DC # AC # AC runtime AC
FullChip Stat cov cov patts patts (hr) Mem
not always possible. When this happens, it is not a shift (%) (%) (GB)
problem that could affect loading and unnloading of scan Largest/longest core level
data, rather a capture problem that can be masked within a - - K
22.6K 70K 51 25
Internal Mode ATPG run
regular timing exceptions flow, at the expeense of some loss
Combined Internal Mode
of test coverage. Second, the full-chipp static timing coverage (before External 97.7 90.3 - - - -
analysis of DFT modes becomes more complicated with Mode top-up)
the multi-level clock control due to a laarger number of External Mode top-up run
generated clocks at multiple hierarchy levels and different w/graybox. Coverage is
99.2 91.4 3.4K
K 5K 7 11
Internal/ External
Internal and External modes clock trees. combined
IV. RESULTS Fullchip ATPG - 89.3 79K 593 235

The SoC design discussed in this paper inncludes versatile Reductions resulting from
networking and storage interfaces, as welll as an advanced Hierarchical - - - - 10.2X 9.4X
implementation
compute processors cluster. It consists of 44.3 million flops
and 25 million gates. There are 18 coress of 11 different
types, including one quartet and four pairss of similar core A. ATPG Runtime
instances, with input channels connecteed in broadcast. As can be seen from Table 1 and d Figure 8, the longest
There are three levels of hierarchical test pattern core-level ATPG run for AC patterrns was completed in 51
generation. hours. Because the hierarchical methodology
m allows for
all cores to run ATPG independen ntly, they can all be run
Figure 7 illustrates at a high level the majoor blocks related simultaneously. This means th hat all cores can be
to the DFT implementation. In this particular completed within that timeframe. External mode ATPG
implementation, since there are more thaan 200 chip pins required another 7 hours for a totall of 58 hours to generate
available for scan, we use a distributtion test access test for the entire chip. Compared to the 593 hours for the
mechanism [3] that also allows runninng ATPG in a full-chip ATPG run, the runtime has been reduced by a
monolithic flat-design fashion on the fulll chip. Although factor of 10.2X.
such TAM architecture is not optimal in teerms of test data
volume [11], it did serve us well as a back-up to the
hierarchical flow and also for performancce comparison to

1130 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE)
Design for Efficient Utilization of Test
600
Bandwidth,’’ ACM. Trans. Design Automation of
500
Electronic Systems, vol. 8,no. 4, 2003, pp. 399-429.
400
300 [4] Teresa McLaurin, Frank Frederick, Rich Slobodnik,
200 Full-chip “The Testability Features of the ARM1026EJ
Microprocessor Core”, International Test Conference 2003
100 External
0 [5] Rick Fisette, Jeffrey Remmers, Moe Villalba,
Internal
p e d r s ext fc “Hierarchical DFT Methodology – A Case Study”,
International Test Conference 2004
Core ATPG runtimes (Hr) Mem [6] Jeffrey Remmers, Darin Lee, Rick Fisette,
(GB)
“Hierarchical DFT with Enhancements for AC Scan, Test
Scheduling and On-chip Compression – A Case Study”,
Fig. 8 – Core-level hierarchical vs. monolithic ATPG runtimes and International Test Conference 2005
memory footprint
[7] Prakash Srinivasan, Ronan Farrell, “Hierarchical DFT
with Combinational Scan Compression, Partition Chain
B. Memory Footprint and RPCT”, IEEE Annual Symposium on VLSI 2010
The largest core-level ATPG run was able to fit into 25GB [8] V.R. Devanathan, C.P. Ravikumar, V. Kamakoti,
of memory. The sizes of grayboxes varied from 0.7% to “Reducing SoC Test Time and Test Power in Hierarchical
20% of the original core size, in terms of the number of Scan Test: Scan Architecture and Algorithms”,
design instances. The top-level ATPG with grayboxes International Conference on VLSI Design 2007
consumed 11GB of memory. In comparison, the full-chip
ATPG memory consumption was 235GB. This [9] IEEE Std 1500-2005 , “IEEE Standard Testability
implementation has reduced the memory footprint required Method for Embedded Core-based Integrated Circuits”
for ATPG run by a factor of 9.4X. [10] O. Sinanoglu and T. Petrov, ‘‘A Non-intrusive
V. CONCLUSIONS Isolation Approach for Soft Cores,’’ Proc. Design,
Automation and Test in Europe Conf. (DATE 07), IEEE
Our main goal was to make the load on the DFT engineer CS Press, 2007, pp. 27-32.
more uniform throughout the project and to avoid spikes in
workload, especially towards tapeout. With the [11] Sinanoglu, O.; Marinissen, E.J.; Sehgal, A.;
hierarchical approach and flexibility offered by the multi- Fitzgerald, J.; Rearick, J., "Test Data Volume Comparison:
level clock control architecture and pattern retargeting, this Monolithic vs. Modular SoC Testing," Design & Test of
goal has been achieved very efficiently. The DFT team Computers, IEEE , vol.26, no.3, pp.25,37, May-June 2009
walked hand in hand with the physical design team and [12] Rafal Baranowski, Michael Kochte, Hans-Joachim
some blocks were put on the shelf long before tapeout, Wunderlich, “Scan Pattern Retargeting and Merging with
with pattern sets ready, debugged, and verified in SDF- Reduced Access Time”, IEEE European Test Symposium
annotated simulations. As the tapeout date approached, 2013
most of the DFT work had already been completed at the
block level and, for the first time, DFT was not the gating [13] Martin Keim, Mark Kassab, Rick Fisette, “What’s the
item to tapeout. Although this methodology required some Difference Between Scan ATPG and IJTAG Pattern
extra design effort early on in the schedule and imposed Retargeting”, Electronic Design Jan 22, 2013
certain challenges on the physical design and timing [14] Ron Press, “Design Clock Controllers for Hierarchical
closure, overall it saved valuable weeks of work at the Test”, EDN July 18, 2014
most critical time of the project and removed the DFT
from the critical path. [15] Xijiang Lin, Mark Kassab, “Test Generation for
Designs with On-Chip Clock Generators”, Asian Test
Symposium 2009
REFERENCES [16] B. Keller, K. Chrakravadhanula, B. Foutz, V.
[1] Y. Zorian, E. J. Marinissen and S. Dey, “Testing Chickermane, A. Garg, R. Schoonover, J. Sage, D. Pearl,
Embedded-Core Based System Chips”, in ITC, pp. 130– T. Snethen, “Efficient Testing of Hierarchical Core-Based
143, 1998. SOCs”, International Test Conference 2014
[2] Marinissen, E.J. ; Kapur, R. ; Zorian, Y. “On Using [17] Ron Press, “Benefits of Moving to Plug-and-Play
IEEE P1500 SECT for Test Plug-n-Play”. ITC 2000[3] Hierarchical DFT”, Chip Design Jan. 21, 2014
S.K. Goel and E.J. Marinissen, ‘‘SoC Test Architecture.

2016 Design, Automation & Test in Europe Conference & Exhibition (DATE) 1131

You might also like