Test Architecture For Systolic Array of Edge-Based AI Accelerator
Test Architecture For Systolic Array of Edge-Based AI Accelerator
ABSTRACT The application diversity and evolution of AI accelerator architectures require innovative DFT
solutions to address issues such as test time, test power, performance and area overhead. Full scan DFT,
because of its enhanced controllability and observability, is an industrial de facto test strategy. However,
it may not yield an optimal test solution with stringent design constraints of edge-based AI accelerators.
In this paper, a novel test architecture based on selective-partial scan is proposed for performance, power
and area (PPA) overhead constrained edge-based systolic AI accelerator. In this architecture, the structural
test patterns are applied partly in functional manner, which reduces the testability problem of an array to
that of a single processing element (PE); thus, resulting in reduced test time and test data volume. Moreover,
a delay fault testing method based on Launch-on-Capture is presented for the partial scan based proposed
architecture. Experimental results show that proposed architecture is efficient in terms of test power and test
time when compared to full scan DFT.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
96700 VOLUME 9, 2021
U. S. Solangi et al.: Test Architecture for Systolic Array of Edge-Based AI Accelerator
the addition of an extra scan MUX logic (per Flip Flop) and ensure every possible transition as sufficient condition to
additional routing (for scan chain stitching). Which in case test the sequential cells. Moore and Bawa [22] presented
of an accelerator, multiplies with increasing array size of the testing method for a bit-level unilateral systolic array, where
accelerator and may exceed allowed limits of size and power length of the test vectors increases with the size of an array.
for an edge based AI accelerator. A full scan based C-testing It uses a row comparator for each column for generating test
approach was proposed in [16], where testability effort has pass/fail result, thereby compressing the test response for the
been confined to single PE. This C-testing approach results array. The main limitation of C-testability based functional
an improvement in test time and test pattern reduction. In this testing is the detection of a single faulty cell from the whole
paper, we propose partial scan based DFT architecture having array. BIST solutions (based on single cell fault model) for
low overhead (PPA) for edge-based AI hardware. The key array multipliers with deterministic (constant) patterns are
contributions of this paper are; presented in [23] and [24] in which MUX logic is introduced
• Investigations of conventional test solutions for sys- as a DFT solution to switch between functional and test mode.
tolic array; Sequential ATPG and Full scan are first Besides, strategies for testing identical cores have been
implemented for weight-stationary systolic array (based proposed. Giles et al. [25] have addressed the testing of
on TPU model) with FC analysis and associated test multiple identical cores by providing a scalable parallel test
overhead. access mechanism (TAM) architecture. In this architecture,
• A test architecture based on partial scan and systolic the response paths from each core are pipelined through com-
pattern loading with a built-in checking circuitry is parators in order to compare the response of each core with
proposed for weight-stationary systolic array (based on a core, which is already tested by the ATE. Han et al. [26]
TPU model). proposed a TAM architecture for multiple identical cores that
• Partial broadcasting is proposed for test pattern loading uses majority voting for checking test response of each core
(for test time synchronization) for arrays of different and the majority response is cross checked with the ATE
sizes (>16 × 16). Test cost of the proposed test archi- response. The key takeaway is that majority of the cores
tecture is presented and compared with full scan. will be matched to the expected response and can distinguish
• A delay fault testing method based on Launch- the minority cores with faulty response through majority
on-Capture is presented for the proposed architecture. analyzer. A method for concurrent error checking between
• Evaluation of the proposed method is also performed neighboring elements in a systolic array is presented in [27].
in comparison with Checkerboard based full scan This requires additional XOR logic for output comparisons
method [16]. between neighboring elements and may result in an increased
test area overhead.
The remaining paper is organized as follows. In Section 2, Ma et al. [28] have tested an AI based SoC by broadcasting
various array testing methods are discussed. Section 3 briefly test patterns by embedded deterministic test (EDT) to the
introduces Google’s TPU model that was used for imple- identical cores to reduce test time. These cores are isolated by
mentation of this work. In Section 4, we present our anal- IEEE 1500 wrappers and are tested by means of comparator
ysis of conventional test methods. Section 5 presents the in subsequent test modes. However, this testing approach
details of the proposed test architecture and its operation. results in a very high routing congestion due to input channel
Section 6 gives the details for the proposed solution for broadcasting, and due to the hardware overhead associated
at-speed testing with partial scan based test architecture. with EDT, it is not an optimum solution for the edge-based
In Section 7, the results for associated experiments are given. AI accelerator. Moreover, this state-of-the-art solution uses
Finally, we present the conclusion in Section 8. full scan DFT approach, which may not be a suitable solution
for systolic array. The reason is that the circuit connectivity
II. RELATED WORK may not allow each FF to provide same level of controlla-
Testing of iterative arrays have been previously studied with bility and observability, which is the case for most of the
C-testability, which is primarily based on functional test- pipeline flow-based accelerators with unidirectional connec-
ing with constant number of test patterns to test each PE tions. A framework for functional criticality based stuck-at
[17], [18]. Friedman [18] presented a theory for modified fault analysis for inference applications is presented in [29].
C-testability based on the function of the processing cell, This machine learning based gate-level netlist analysis prior
which detects single faulty cell of an array. Sung [19] pre- to manufacturing test to target location specific structural
sented sufficient conditions to ensure testability of unilateral faults for testing may optimize the test generation by spec-
and bilateral arrays for detection of a single faulty unit. ifying the test points/ test pattern generation for these critical
Elhuni et al. [20] have shown that the test pattern length can locations. However, this machine learning based analysis may
be made independent of the size of the array, but this method add to the time-to-market constraint and affect the overall test
is limited to one dimensional iterative array. Lombardi [21] cost.
has extended the C-testability approach to systolic arrays Recently, a C-testing approach based on full scan DFT is
provided there are additional patterns to be used for testing proposed in [16]. Homogeneity of PE is exploited for test-
the sequential cells (FFs) of a processing unit. These patterns ing sub-arrays in multiple iterations, which are executed in
checkerboard style. Compared with industrial EDT approach, array. The weight is stored in the weight register of the PE,
this proposed method results in improved test time and since as depicted in Fig 1. Subsequently, the activation inputs are
test patterns are generated on PE level, the number of patterns fed from the activation memory and are systolically shifted
is also reduced. This avoids the need for ATPG effort for across the corresponding PE column along with the gener-
full array. Unlike previous work, our proposed architecture ated partial sum. Each PE performs multiplication between
enables the concurrent detection of multiple fault cells com- activation and weight inputs. This product is then added to the
pared to single fault model of C-testing approach. The ATPG partial sum that is generated by the preceding PE to generate
effort has been reduced to generation of the test patterns the partial sum for the succeeding PE. In a PE, the partial
only for combinational logic of PE. Also, it requires less sum is generated by the combinational datapath (multiplier
ATE involvement by checking the response by the built-in and summation circuitry) and is captured by the partial sum
test pass/fail logic. In the proposed test architecture, the test register of the succeeding PE (along the column). Due to
patterns are loaded systolically as well as by broadcasting to industrial significance of TPU, we have implemented the
partial scan chains. The test response of all PEs is compared proposed DFT architecture and developed the Verilog model
for test pass/fail signal for the array. Consequently, in com- of this systolic MMU based on [30]–[33].
parison to the full scan test, the test time and test power are
significantly reduced; moreover, the area overhead is also
reduced. IV. CONVENTIONAL TEST SOLUTIONS
FOR TPU’S SYSTOLIC ARRAY
To find an optimal test solution for the systolic TPU, first,
III. SYSTOLIC ARRAY-BASED MATRIX the gate-level netlist of the verilog model is subjected to
MULTIPLICATION UNIT sequential ATPG, as it incurs no DFT hardware overhead. For
In order to perform real time inference operation on streaming this, Tetramax ATPG is used to obtain the FC and sequen-
data, the accelerator needs to perform frequent data read oper- tial ATPG patterns. It is observed that with an increasing
ations. This read operation is more energy consuming as it array size, the FC degrades, as shown in Fig. 2a. This hap-
needs to access the memory. Therefore, such read operations pens due to an increasing depth of sequential path. Since
in an edge-based accelerator may not be a suitable choice due inference-based classification accuracy of the DNN operation
to its limited energy resource. On the other hand, the acceler- is heavily affected by the degradation of FC, the sequential
ators with spatial connection, like systolic array, the connec- ATPG testing is not a suitable test solution for the systolic
tivity between neighboring cells requires much less energy array-based accelerator.
and low bandwidth [12]–[14]. Google’s TPU is mainly used For full scan test, all the FFs are replaced with muxed
in the Clouds/Datacenters for inference applications with scan FFs and it was synthesized with Synopsys Design Com-
256 × 256 Matrix Multiply Unit (MMU) as inference engine. piler. This conventional approach offers FC near to 100%.
Whereas its Edge version with the smaller array size and However, with full scan testing, the area overhead, test time
power consumption (2 Watts) uses quantized weight bits, e.g., and test power increase with increasing array size. The DFT
8-bits [7]. TPU’s MMU is based on weight-stationary systolic area/logic overhead increases mainly due to increasing num-
array to allow reusability of weights in subsequent layers of ber of scan MUX logic and routing overhead, as shown
the DNN. Each PE of the array generates a partial sum and in Fig. 2b. The test time increases due to increasing number
are accumulated at the end of each column in an accumulator. of scan cells, as shown in Fig. 2c, and the test power increases
During the normal operation of an MMU, initially, due to serial scan shift of test data, as shown in Fig. 2d. This
the pre-trained weights are fed from weight memory and are makes full scan method unscalable and infeasible approach
systolically shifted across the corresponding PE row of the for testing an edge-based AI hardware, which has smaller
FIGURE 2. Issues with conventional Test Solution (a)FC drop with increasing array size (b) Increase in area overhead for Full Scan DFT
(c)Increasing test time with increasing array size (d) Test power for increasing array size.
physical and power footprint. If the array size is smaller, patterns for datapath unit are applied in functional manner
e.g. up to 8 × 8, full scan implementation is done with via activation and weight registers, and the test response
single scan chain. For the array sizes greater than 8 × 8, e.g., is captured by the partial sum register’s scan chain of the
16 × 16 and 32 × 32, multiple scan chains are synthesized. succeeding PE in a column. A single column of PEs with
The number of scan chains is configured to allow the same the proposed architecture is shown in Fig. 3. The Scan input
scan chain length for 16 × 16 and 32 × 32 as there are in ’SCANIN’is broadcast to all the PEs of the array to load each
8 × 8 array, i.e. 4 scan chains for 16 × 16 array and 16 scan scan chain in parallel. The capture response from each scan
chains for 32 × 32 array. This is done to restrict the scanin chain (RC1 , RC2 , · · · , RCN ) is shifted to a Built-in checking
and scanout pins. circuitry that compares the response bits from each scan chain
and generates pass/fail signal. Since last PE in a column is
connected to the accumulator, to capture the response of the
V. PROPOSED TEST ARCHITECTURE last PE of each column, partial sum from the last PE is loaded
Since the array contains identical PEs, it would be an effective into ’Response Capture Register’as shown in Fig. 3.
approach to confine the test efforts to a single PE. To achieve
this, the components of PE are separately observed. A PE
consists of sequential cells (activation, weight and partial sum
register) and a combinational datapath. For the synthesized
model, in a single PE, the combinational datapath comprises
most of the logic (55%) and interconnect (57%), also it
contributes to over 90% of the computation in a DNN layer.
Moreover, the stuck-at faults in the datapath severely affect
the classification accuracy in inference applications. Because
of this structural and computational significance, datapath
circuit is considered exclusively for fault detection. In our
model, the datapath consists of an 8-bit multiplier and a 16-bit
summation circuitry, as shown in Fig. 1. Tetramax ATPG
provided 100% coverage with only 15 test patterns for this
datapath circuit (tested separately). The test pattern length
was 32-bit wide; 16-bits for partial sum input, 8-bits for
activation and 8-bits for weight registers. The test architecture
is developed to allow the application of these test patterns to
each PE separately yet simultaneously. Constraining testabil- FIGURE 3. A single column implementation with proposed architecture.
ity to a single PE allows reduction in test data volume, as only
16 test patterns will be used to test array of any size. A. TEST PATTERN APPLICATION WITH
It is depicted in Fig. 1 that the activation and weight BUILT-IN CHECKING
registers have pipelined connectivity; thus, these registers It is assumed that the memory unit feeding the test patterns
only allow applying the test patterns from primary inputs is already tested. We propose to use on-chip memory to store
and they cannot capture any test response from any datapath the 15 test patterns that will be loaded into the activation and
circuitry. Whereas, the content of partial sum register of a PE weight memory. The approach is based on deterministic BIST
is transferred to the succeeding PE’s partial sum register. This techniques, where compressed patterns/seeds are stored in the
spatial connection enables the capture of test response of a PE on-chip ROM. For the PEs directly connected to the both
into the succeeding PE. For this reason, only the partial sum memories, the activation and weight part of the datapath’s
register is synthesized with scan chain to provide essential test patterns are applied in a single test clock cycle. On the
observability for the captured response. The structural test successive test clock cycles, the same test pattern will be
FIGURE 4. Number of clocks required for parallel and serial loading of a single test pattern in a
16 × 16 array.
loaded into the rest of the PEs by systolic shifting. To do that, Moreover, successive pattern application with systolic
the enable logic signal of the weight register FFs is ORed with dataflow ensures that each FF of the activation and weight
test mode signal to allow weight shifting during the test mode. registers goes through all possible transitions, ensuring the
For the partial sum portion of the datapath input pattern, testability conditions for the FFs. So, if there is a stuck-at
the test pattern is serially shifted-in with multiple (equal to the fault in any FF of the registers (activaiton, weight, partial
number of partial sum FFs in a single PE) test clock cycles for sum) in a PE, it would result in a different input pattern being
each PE by the shared scanin pin. Fig. 4 shows the loading of applied to the datapath circuit and its/their response will cause
the test patterns for a 16×16 array and the number of required a mismatch at built-in checking circuitry when compared with
clock cycles for the input pattern loading in the registers. other PE responses. The FFs of the partial sum register are
During the functional mode, while capturing the test tested by applying the scan chain pattern (..001100..) at the
response of a datapath into the succeeding PE, the activation start of serial shifting.
and weight inputs are held constant in the memory (activation In addition to the structural testing, the proposed architec-
and weight) units. Again, during the test mode, the captured ture enables the functional testing of the PEs, where func-
test response is unloaded serially with simultaneous shifting tional patterns can be loaded into the activation and weight
in the next test pattern from the partial sum scan chain, registers through memory units systolically. Subsequently,
while activation and weight inputs are applied systolically. each datapath’s functional response is captured by the partial
If, during the test response unloading, any bitwise mismatch sum scan chain and shifted out concurrently from each PE.
between the test responses of various PEs occurs, it is reg- These responses can be compared and checked by the built-in
istered as a Test FAIL flag by the built-in checking circuitry checking circuit.
and the testing will be ended.
Since in the proposed technique, the test patterns are B. PARTIAL BROADCASTING OF INPUT PATTERNS
applied in functional manner, the test response against a For the array size of more than 16×16 PEs, the partial broad-
test pattern for all the PEs must be same in case of no cast is proposed in which the array is divided into multiple
stuck-at fault. However, the presence of a stuck-at fault in blocks of 16×16 PEs. Instead of broadcasting the test patterns
datapath unit and registers will result in a mismatched data- to all the PEs of the array, broadcasting is done to the first PE
path response compared to the response of other non-faulty of each block. Each block will require exactly 16 cycles to
PEs in a column. The proposed architecture has an inte- load the test patterns into the partial sum scan chains (serially)
gral built-in checking circuitry that acts as a comparator as and into the 16 activation and weight registers (systolically).
shown in Fig. 3, which has a combinational logic. In the Since broadcasting allows the loading/unloading of patterns
built-in checking circuitry, serial scan out of each PE is shared for each block in 16 cycles, the whole array requires only
with AND and OR gate input that can detect any single bit 16 cycles to load the pattern. Hence this broadcast technique
mismatch among any number of output response streams improves the testing time of the whole array.
from partial sum scan chains by raising the flag to ’1’. The This broadcasting can also be used for the array sizes which
flag gives false indication only when all the PEs have same are not multiple of 16. For example, a 24 × 24 array can be
faulty response. All the response flags from built-in checking divided into 4 blocks of 12 × 12 as shown in Fig. 5. The only
circuits are ORed for the detection of faults of whole array. restriction is that a block cannot be greater than 16×16 as the
pattern loading requires exactly 16 cycles. In that case, first for summation circuitry is from vertically preceding adjacent
block is always made of 16×16. This broadcasting is enabled PE (for example, from PE11 to PE21 and PE22 to PE32 ).
only during test mode. This way the array size would not be a Based on the proposed test architecture an implementation
factor to affect the overall test time. And number of test cycles for a 3 × 3 array is presented in Fig. 6. The launch vector
(eventually test time) determined by the size of the partial V2 is timed to match the systolic loading into adjacent PEs.
sum register. Furthermore, this broadcasting of input test For example, at first clock cycle the vector V2 (activation and
patterns allows the scalability of the proposed architecture weight) is loaded into PE11 only and on the 2nd cycle vector
with increasing array size. Fig. 5 shows an example of partial V2 is loaded systolically into PE12 (activation) and PE21
broadcast implementation. Where MUX logic is inserted at (weight) from PE11 . While remaining part of V2 is loaded
boundaries of 16 × 16 block to allow input pattern sharing to from the memory into PE12 (weight) and PE21 (activation) on
the first PE of the other block directly from memory units. the 2nd cycle. The 2nd cycle will capture the response form
PE11 into PE21 (vertically adjacent) in the partial sum register
as shown in the timing diagram and this captured response
VI. AT-SPEED TESTING WITH PROPOSED ARCHITECTURE will launch a transition for the summation circuitry of PE21
To enable at-speed testing of the TPU systolic array, vector as shown in timing diagram Fig. 7. A series of launch/capture
pairs for launching the transitions are extracted for the dat- at-speed clocks (in addition to shift-in/shift out clocks for
apath combinational logic (such as done for stuck-at faults). loading V1) is applied to test the whole array for a single
As the activation and weight registers are non-scan flip flops vector pair (V1:V2). The number of these at-speed clocks is
the vectors are applied at-speed from the memory in the sys- dependent upon the size of the array, i.e., for n × n array n + n
tolic pipelined method. The application of vectors is depicted at-speed clocks are required for a single transition pattern in
in Fig. 6. First the vector V1 is loaded into activation and addition to 16 clock cycles for scanning in and scanning out
weight register systolically, while concurrently loading the at scan shift frequency.
partial sum scan chain input (serially) in 16 clock cycles. For a 3 × 3 array, after 6 cycles, the response from each
After the vector V1 is loaded into each PE, test enable is column is collected at the response capture registers. But
deactivated to allow at-speed launching of transition with each column’s response is captured at different cycles. For
vector V2. It may be noted the vector V2 for activation column 1, since there are 3 PEs 4th cycle will capture the
and weight register is loaded from memory units while the response. For column 2, 5th cycle captures the response and
transition vector for at-speed testing of summation circuitry is 6th cycle will capture the response for column 3. Each capture
launched (generated functionally) from the datapath logic of response register is clocked at these specified clock cycles.
the previous PE (vertically connected). With systolic loading, This is enabled by the clock control circuitry shown in Fig. 8.
the activation and weight register receive same transition This control is required to prohibit any fault to be masked
launching vector V2. While the transition launching vector (change any faulty response value to be corrected) that can
FIGURE 6. Vector loading into a 3 × 3 systolic array for transition delay testing.
FIGURE 7. Timing diagram with Launch, Capture, Scanin and Scanout operations for a 3 × 3 array.
happen due to multiple capture cycles. For example, if any vector value through that column. However, the launch vector
fault occurs in the response from PE11 in the 2nd cycle the from partial sum registers of horizontally adjacent PEs is
3rd can mask this fault into PE21 summation circuitry. The 5th same. Hence the built-in checking is done among the col-
and 6th cycle may mask this fault for the whole column. So, umn responses to detect this mismatch in responses, unlike
to stop masking of this fault response capture, the response proposed scan testing method (in Fig. 3), where response is
of column 1 is locked at 4th cycle. As this fault changes the compared among each PE. Here the column response is the
launch vector for (vertical) adjacent PE’s summation circuit response from the last PE in that column. After the 6th th
(in column 1), a different transition vector is propagated cycle, response from each response capture register is seri-
through that column (1) as compared to other columns (2 3), ally unloaded concurrently into the built-in checking circuit,
resulting in different response. The partial sum registers of which is a separate unit than the one that is used for stuck-at
PEs other than first PE in each column is connected with fault testing. Like stuck-at fault response checking, built-in
partial sum response of vertically adjacent PEs. This will checking circuit will detect any mismatch among the column
result in partial sum registers launching different transition ¯
responses. A signal textquotesingle stuck/(tran)’, switches
between stuck-at fault testing and transition testing mode. time of a single PE, it remains the same for any array size and
This signal also switches between Test/Pass fail logic unit of there is growing improvement in the test time with an increas-
stuck-at faults testing and Test/Pass fail logic unit of transition ing array size. For an array of less than 16 PEs, 16 clock
faults testing. The overhead for the response capturing unit is pulses will still be applied to activation and weight registers
1.6% for a 3 × 3 array and since this unit is implemented per to synchronize with 16 clock pulses for shift-in and shift-out
column of the whole array, its overhead decreases with the operation of each partial sum scan chain. For this reason,
array size. This is because the response capture unit increases the test patterns data from memory is held constant for 17
linearly unlike the exponential increase in logic of the whole (16 for shift in and 1 for capture) cycles. Test time for multiple
array. scan chain-based arrays (16 × 16 and 32 × 32) with same
scan chain length does not improve due to increased number
VII. EXPERIMENTS AND RESULTS of patterns, which is due to relative increase in combinational
A. AREA OVERHEAD logic.
The systolic TPU model for various array sizes is synthesized The table 3 implements the full scan based array testing
with the Design Compiler on SAED 32nm Library. With as proposed in [34]. Which mainly considers time-to-market
proposed synthesized design there is DFT logic area overhead as the main constraint and proposes to use broadcasting of
reduction of approx. 11% in sequential area, 9% in intercon- the test patterns to test multiple identical modules simulta-
nect area and around 4% in total area compared to full scan neously. The proposed architecture maintains advantage over
DFT given in TABLE 1. It is evident that total area overhead full scan DFT with various array module implementations,
is marginally improved as compared to the sequential and where the whole array (full scan) is divided into smaller
routing area overhead. This is because, most of the total area sub-modules for pattern broadcasting/sharing.
is taken up by the combinational logic of datapath circuit. The
area overhead of the test pass/fail logic for a 32 × 32 array D. TEST POWER
is 0.1%. As with full scan DFT, shift power of serial scan shifting
TABLE 1. Reduction in area overhead.
of test patterns depend on the length of scan chain and size
of the combinational logic. This results in the increase in
shift power with increment in array size. For the proposed
architecture, value change dump (VCD) files were generated
by the gate-level netlist simulation for obtaining event driven
test power of the whole array with Synopsys Primepower.
As the proposed architecture uses a smaller number of pat-
terns with serial scan chain length limited to 16 for each PE,
there is improvement in serial shift power compared to full
scan DFT, as shown in TABLE 2 and TABLE 3. Compared
to Full scan DFT where a single pattern may cause multiple
transitions during serial shift operations at single scan FF
B. PERFORMANCE OVERHEAD (of activation and weight register). In the proposed archi-
Full scan FFs always introduce performance penalty to the tecture, a single pattern loading may cause only a single (at
original circuit due to additional MUX logic and fanout in maximum) transition at non-scan FF (activation and weight
the critical path. The synthesized proposed design has no scan register). Also, the partial scan is connected to summation
chain fanout for activation and weight register flip flops. This circuitry of the datapath, whereas full scan chain is connected
represents an average of 26.44% reduction in fanout capac- to the whole datapath unit (summation and multiplier cir-
itance compared to the full scan flip flops. This results in cuitry). This results in lesser dynamic power consumption in
reduced delay and less dynamic power (αCV 2 f ) for sequen- the combinational logic during the scan shift operation in the
tial cells (FFs). Moreover, there is no additional propagation proposed architecture. This reduction in number of transitions
delay of scan MUX logic in activation and weight registers. combined with smaller scan chain per PE and lower scan
shift power per pattern causes a proportional reduction in the
C. TEST TIME overall test power.
For evaluation of test time for the arrays with proposed Moreover, with limited power footprint, edge-based AI
design. Gate-level netlist simulations were performed on devices are more vulnerable to peak test power (maximum
Modelsim with test frequency of 10 MHz. In addition to power consumed at any single test clock cycle), as it may
15 test patterns to test the datapath, a scan chain test pat- cause reliability issues (like hotspots). Since the proposed
tern (· · · 00110 · · · ) is included to test the partial sum scan architecture uses partial scan, the number of scan FFs cap-
chain. With these gate-level netlist simulations the test time turing the response at a single test clock cycle and number
improvement for the proposed design is compared to full scan of transitions (during serial shift) occurring at any test clock
design in TABLE 2. As test application time for the whole cycle is reduced. Both factors contribute to reduction in
array (in the proposed architecture) is matched with the test peak power. This reduction in peak test power improves the
TABLE 2. Results for the test time, power and peak power of the proposed architecture against test pin constrained f.scan.
TABLE 3. Results for test time, power and peak power of the proposed architecture against test time constrained f.scan [34].
reliability of the hardware. For full scan arrays 16 × 16 and per PE is synthesized to restrict the delay-based testing to
32 × 32, multiple scan chains do not alleviate test power and single PE, as done for the proposed architecture. The full
peak power consumption, mainly because of additional scan scan delay testing is done with LoC method. The number
chains (compared to 8×8) and increased power consumption of patterns with array size increases as shown in Table 4.
in combinational logic. While for proposed partial scan-based This increase in number of patterns with increasing array
arrays there is improvement over their full scan counterparts. size results in increasing test power and test time for full
The total power overhead of the test pass/fail logic for the scan. Since with the proposed partial scan based architecture,
whole duration of pattern application in our proposed archi- the number of patterns is fixed i.e. 16, it results in increasing
tecture for a 32 × 32 array is 4.6%. test time improvement. While maintaining advantage over the
shift power, as (per PE) only half of the scan elements are
E. TEST POWER AND TEST TIME FOR AT-SPEED TESTING shifting the patterns, when compared to full scan DFT.
For at-speed testing, the proposed architecture is simulated
on Modelsim with its custom flow for transition patterns F. CHECKERBOARD FULL SCAN TEST METHOD
(vector pairs). From Tetramax ATPG, 16 transition vector The proposed partial scan method is also evaluated in parallel
pairs were generated for datapath combinational logic. Test with the checkerboard method [16]. A 32-bit partial sum
power for customized pattern flow for various array size is register based TPU model is considered for proposed partial
estimated with Prime Power from the associated VCD file, scan method because the checkerboard method uses the 32-bit
given in Table 4. In full scan based arrays, one scan chain partial sum register. ATPG was performed for stuck-at and
TABLE 4. Results for test time and test power for at-speed testing against f.scan LoC test.
transition faults for this 32-bit model. Since the ATPG effort [3] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, ‘‘Efficient processing of
is limited to Datapath logic, So the number of patterns is less deep neural networks: A tutorial and survey,’’ Proc. IEEE, vol. 105, no. 12,
pp. 2295–2329, Dec. 2017, doi: 10.1109/JPROC.2017.2761740.
than the checkerboard method. Also, the PE has only partial [4] F. Wang, M. Zhang, X. Wang, X. Ma, and J. Liu, ‘‘Deep learning for edge
sum register as scan register, it results in a smaller number of computing applications: A state-of-the-art survey,’’ IEEE Access, vol. 8,
test cycles. The proposed partial broadcasting method allows pp. 58322–58336, 2020, doi: 10.1109/ACCESS.2020.2982411.
[5] J. Chen and X. Ran, ‘‘Deep learning with edge computing: A review,’’
the test time improvement for arrays larger than 32×32 (as the Proc. IEEE, vol. 107, no. 8, pp. 1655–1674, Aug. 2019, doi:
scan chain length is 32 now) as mentioned in section V-B. The 10.1109/JPROC.2019.2921977.
results are shown in Table 5. [6] NVIDIA. (2019). JETSON TX2 High Performance AI at the Edge.
[Online]. Available: https://www.nvidia.com/en-us/autonomous-
machines/embedded-systems/jetson-tx2/
VIII. CONCLUSION [7] Google. (2019). Edge TPU—Run Inference at Edge. [Online]. Available:
Regardless of the application area of an electronic system, https://cloud.google.com/edgetpu/
[8] P. J. Bannon, ‘‘Accelerated Mathematical Engine,’’ U.S. Patent 0 026 078
test cost/overhead presents a major design problem due to its A1, Sep. 20, 2017.
implications on the overall system cost and operation. Imple- [9] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,
menting de facto test techniques such as full scan DFT may S. Bates, N. Boden, A. Borchers, and R. Boyle, ‘‘In-datacenter perfor-
mance analysis of a tensor processing unit,’’ in Proc. 44th Annu. Int.
not yield a cost-effective solution for overhead constrained Symp. Comput. Archit., Toronto, ON, Canada, Jun. 2017, pp. 1–12, doi:
edge computing devices. In this paper, an efficient and scal- 10.1145/3079856.3080246.
able test solution is proposed for weight-stationary systolic [10] Y.-H. Chen, T.-J. Yang, J. S. Emer, and V. Sze, ‘‘Eyeriss v2: A flexible
accelerator for emerging deep neural networks on mobile devices,’’ IEEE
array for an edge-based AI hardware. The proposed architec- J. Emerg. Sel. Topics Circuits Syst., vol. 9, no. 2, pp. 292–308, Jun. 2019,
ture addresses the testability on PE level of the whole array. doi: 10.1109/JETCAS.2019.2910232.
This architecture specific solution leads to an efficient testing [11] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,
‘‘DianNao: A small-footprint high-throughput accelerator for ubiquitous
approach. Due to improvement of test time and test power machine-learning,’’ in Proc. ASPLOS, 2014, pp. 269–284.
with increasing array size, this architecture is also well-suited [12] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic,
for large-scale accelerators of Clouds/Datacenters. E. Cosatto, and H. P. Graf, ‘‘A massively parallel coprocessor for convo-
lutional neural networks,’’ in Proc. 20th IEEE Int. Conf. Appl.-Specific
Syst., Archit. Processors, Boston, MA, USA, Jul. 2009, pp. 53–60, doi:
REFERENCES 10.1109/ASAP.2009.25.
[1] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, ‘‘Edge computing: Vision and [13] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, ‘‘A dynam-
challenges,’’ IEEE Internet Things J., vol. 3, no. 5, pp. 637–646, Oct. 2016, ically configurable coprocessor for convolutional neural networks,’’ in
doi: 10.1109/JIOT.2016.2579198. Proc. 37th Annu. Int. Symp. Comput. Archit. (ISCA), 2010, pp. 247–257.
[2] W. Z. Khan, E. Ahmed, S. Hakak, I. Yaqoob, and A. Ahmed, ‘‘Edge [14] L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and L. Benini,
computing: A survey,’’ Future Gener. Comput. Syst., vol. 97, pp. 219–235, ‘‘Origami: A convolutional network accelerator,’’ in Proc. 25th Ed. Great
Aug. 2019, doi: 10.1016/j.future.2019.02.050. Lakes Symp. (VLSI), 2015, pp. 199–204.
[15] J. J. Zhang, T. Gu, K. Basu, and S. Garg, ‘‘Analyzing and mitigating MUHAMMAD IBTESAM received the B.Sc.
the impact of permanent faults on a systolic array based neural network degree in electrical engineering from the Uni-
accelerator,’’ in Proc. IEEE 36th VLSI Test Symp. (VTS), San Francisco, versity of Engineering and Technology at Taxila,
CA, USA, Apr. 2018, pp. 1–6, doi: 10.1109/VTS.2018.8368656. Taxila, Pakistan. He is currently pursuing the com-
[16] A. Chaudhuri, C. Liu, X. Fan, and K. Chakrabarty, ‘‘C-testing of AI bined M.S. and Ph.D. degree in computer science
accelerators∗ ,’’ in Proc. IEEE 29th Asian Test Symp. (ATS), Nov. 2020, and engineering with Hanyang University. His
pp. 1–6, doi: 10.1109/ATS49688.2020.9301581. research interests include the design for testability
[17] W. H. Kautz, ‘‘Testing for faults in combinational cellular logic arrays,’’ in
(DFT), low power 3-D IC/SiP testing and low
Proc. 8th Annu. Symp. Switching Automata Theory (SWAT), Austin, TX,
power TAM designs for AI accelerators. He was a
USA, 1967, pp. 161–174, doi: 10.1109/FOCS.1967.33.
[18] A. D. Friedman, ‘‘Easily testable iterative systems,’’ IEEE Trans. Com- recipient of M.S.-Ph.D. Scholarship by the Higher
put., vol. C-22, no. 12, pp. 1061–1064, Dec. 1973, doi: 10.1109/T- Education Commission, Pakistan.
C.1973.223651.
[19] C.-H. Sung, ‘‘Testable sequential cellular arrays,’’ IEEE Trans. Comput.,
vol. C-25, no. 1, pp. 11–18, Jan. 1976, doi: 10.1109/TC.1976.5009199.
[20] H. Elhuni, A. Vergis, and L. Kinney, ‘‘C-testability of two-
dimensional iterative arrays,’’ IEEE Trans. Comput.-Aided Design
Integr. Circuits Syst., vol. CAD-5, no. 4, pp. 573–581, Oct. 1986, doi: MUHAMMAD ADIL ANSARI received the B.E.
10.1109/TCAD.1986.1270228. degree in electronic engineering from the Mehran
[21] F. Lombardi, ‘‘On a new class of C-testable systolic arrays,’’ Integration, University of Engineering and Technology (UET),
vol. 8, pp. 269–283, Dec. 1989, doi: 10.1016/0167-9260(89)90020-5.
Pakistan, in 2006, and the M.S. and Ph.D.
[22] W. R. Moore and V. Bawa, ‘‘Testability of a VLSI systolic array,’’ in
degrees in computer science and engineering from
Proc. 11th Eur. Solid-State Circuits Conf. (ESSCIRC), Toulouse, France,
Sep. 1985, pp. 271–276, doi: 10.1109/ESSCIRC.1985.5468108. Hanyang University, South Korea, in 2010 and
[23] S.-K. Lu, J.-C. Wang, and C.-W. Wu, ‘‘C-testable design techniques for 2016, respectively. He worked as an Operations
iterative logic arrays,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., Engineer with Pakistan Telecommunication Com-
vol. 3, no. 1, pp. 146–152, Mar. 1995, doi: 10.1109/92.365462. pany Ltd., from 2006 to 2008, and served as a
[24] D. Gizopoulos, A. Paschalis, and Y. Zorian, ‘‘An effective built-in self- Lecturer for the COMSATS Institute of Informa-
test scheme for parallel multipliers,’’ IEEE Trans. Comput., vol. 48, no. 9, tion Technology, Pakistan, from 2010 to 2011. He is currently with Quaid-
pp. 936–950, Sep. 1999, doi: 10.1109/12.795222. e-Awam University, Pakistan, as an Assistant Professor, from 2011 to 2018,
[25] G. Giles, J. Wang, A. Sehgal, K. J. Balakrishnan, and J. Wingfield, ‘‘Test where he has been an Associate Professor, since 2018. His research interests
access mechanism for multiple identical cores,’’ in Proc. Int. Test Conf., include design-for-testability of digital stacked and non-stacked integrated
Austin, TX, USA, Nov. 2009, pp. 1–10, doi: 10.1109/TEST.2009.5355560. circuits.
[26] T. Han, I. Choi, and S. Kang, ‘‘Majority-based test access mechanism
for parallel testing of multiple identical cores,’’ IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 23, no. 8, pp. 1439–1447, Aug. 2015, doi:
10.1109/TVLSI.2014.2341674.
[27] M. Cheong, I. Lee, and S. Kang, ‘‘A test methodology for neural computing
unit,’’ in Proc. Int. SoC Design Conf. (ISOCC), Daegu, South Korea,
Nov. 2018, pp. 11–12, doi: 10.1109/ISOCC.2018.8649896. JINUK KIM received the B.S. degree in computer
[28] H. Ma, R. Guo, Q. Jing, J. Han, Y. Huang, R. Singhal, W. Yang, X. Wen, science and engineering from Hanyang Univer-
and F. Meng, ‘‘A case study of testing strategy for AI SoC,’’ in Proc. IEEE sity, South Korea, in 2015, where he is cur-
Int. Test Conf. Asia (ITC-Asia), Tokyo, Japan, Sep. 2019, pp. 61–66, doi: rently pursuing the combined M.S. and Ph.D.
10.1109/ITC-Asia.2019.00024. degree in computer science and engineering. His
[29] A. Chaudhuri, J. Talukdar, F. Su, and K. Chakrabarty, ‘‘Functional crit- research interests include design-for-testability
icality classification of structural faults in AI accelerators,’’ in Proc. (DFT), memory ECC, memory test, and 3-D IC /
IEEE Int. Test Conf. (ITC), Nov. 2020, pp. 1–5, doi: 10.1109/ITC44778. SiP (system-in-package) testing.
2020.9325272.
[30] J. Ross, N. Jouppi, A. Phelps, R. Young, T. Norrie, G. Thorson, and D. Luu,
‘‘Neural network processor,’’ U.S. Patent 9 747 546 B2, May 21, 2015.
[31] J. Ross and A. Phelps, ‘‘Computing convolutions using a neural network
processor,’’ U.S. Patent 9 697 463 B2, May 21, 2015.
[32] J. Ross, ‘‘Prefetching weights for use in a neural network processor,’’ U.S.
Patent 9 805 304 B2, May 21, 2015.
[33] J. Ross and G. Thorson, ‘‘Rotating data for neural network computations,’’ SUNGJU PARK (Senior Member, IEEE) received
U.S. Patent 9 747 548 B2, May 2015. the B.S. degree in electronic engineering from
[34] R. Singhal, ‘‘AI chip DFT techniques for aggressive time-to-market,’’ Hanyang University, South Korea, in 1983, and
Mentor, Siemens Bus., White Paper, 2019. the M.S. and Ph.D. degrees in electrical and com-
puter engineering from the University of Mas-
sachusetts, USA, in 1988 and 1992, respectively.
From 1983 to 1986, he was with Gold Star Com-
UMAIR SAEED SOLANGI received the bache- pany, South Korea. From 1992 to 1995, he served
lor’s degree in electronic engineering and the mas- for IBM Microelectronics, Endicott, NY, USA,
ter’s degree in embedded systems from Mehran as a Development Staff, in-charge of boundary
University, Pakistan. He is currently doing Ph.D. scan and LSSD scan design. Since 1995, he has been a Professor with the
research in the field of design for testability with Department of Computer Science and Engineering, Hanyang University. His
Hanyang University, ERICA, South Korea. He is research interests include the area of VLSI testing, including scan design,
also an Assistant Professor with the Department built-in self-test, test pattern generation, fault simulation, and synthesis of
of Electronic Engineering (at a public sector uni- test. Additional interests include graph theory and design verification. He is
versity), Pakistan. Other research interests include a member of the Institute of Electronics Engineers of Korea, the Korea
embedded systems, low power design, and digital Information Science Society, and the Institute of Electronics and Information
logic design. He was a recipient of the Ph.D. Scholarship by the Higher and Communication Engineers.
Education Commission, Pakistan.