Test Architecture For Systolic Array of Edge-Based AI Accelerator

This document proposes and evaluates a test architecture for edge-based artificial intelligence (AI) accelerators implemented as systolic arrays. It begins with background on edge AI hardware and discusses limitations of existing full-scan design-for-testability (DFT) approaches. The key contributions are a novel partial-scan test architecture for systolic arrays with reduced test overhead, and a delay fault testing method using Launch-on-Capture. Experimental results show the proposed architecture has lower test power and time compared to full-scan DFT.

Uploaded by

Ankush BL

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views

Test Architecture For Systolic Array of Edge-Based AI Accelerator

Uploaded by

Ankush BL

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Received May 30, 2021, accepted June 21, 2021, date of publication July 5, 2021, date of current version

July 14, 2021.

Digital Object Identifier 10.1109/ACCESS.2021.3094741

Test Architecture for Systolic Array

of Edge-Based AI Accelerator
UMAIR SAEED SOLANGI 1,2 , MUHAMMAD IBTESAM 1 , MUHAMMAD ADIL ANSARI 2,

JINUK KIM 1 , AND SUNGJU PARK 1 , (Senior Member, IEEE)

1 Department of Computer Science and Engineering, Hanyang University, Seoul 04763, South Korea
2 Department of Electronic Engineering, Quaid-e-Awam University of Engineering Science and Technology, Larkana 77150, Pakistan
Corresponding author: Sungju Park (paksj@hanyang.ac.kr)
This work was supported in part by the Higher Education Commission, Government of Pakistan, under the Scholarship Program Faculty
Development of University of Engineering Science and Technology Pakistan/University of Engineering and Technology (UESTPs/UETs),
in part by the BK21 FOUR (Fostering Outstanding Universities for Research) funded by the Ministry of Education (MOE), South Korea,
and in part by the National Research Foundation of Korea (NRF).

ABSTRACT The application diversity and evolution of AI accelerator architectures require innovative DFT
solutions to address issues such as test time, test power, performance and area overhead. Full scan DFT,
because of its enhanced controllability and observability, is an industrial de facto test strategy. However,
it may not yield an optimal test solution with stringent design constraints of edge-based AI accelerators.
In this paper, a novel test architecture based on selective-partial scan is proposed for performance, power
and area (PPA) overhead constrained edge-based systolic AI accelerator. In this architecture, the structural
test patterns are applied partly in functional manner, which reduces the testability problem of an array to
that of a single processing element (PE); thus, resulting in reduced test time and test data volume. Moreover,
a delay fault testing method based on Launch-on-Capture is presented for the partial scan based proposed
architecture. Experimental results show that proposed architecture is efficient in terms of test power and test
time when compared to full scan DFT.

INDEX TERMS Design for testability, systolic arrays, TAM, testing.

I. INTRODUCTION usually favor spatial dataflow architectures, which enable

Currently, most of the artificial intelligence (AI) applica- transfer of data between neighboring processing elements
tions are running on clouds/datacenters. However, with an (PEs). This pipelined dataflow avoids the need for fre-
enormous amount of data being produced by the consumers, quent memory read operations that result in energy opti-
there is a growing need for edge AI accelerators. The edge mization [10]. Variants of weight-stationary systolic array
computing offers a cost-effective and low data bandwidth are used to accelerate the CNN inference with low power
solution by bringing data processing local to the source of consumption in [12]–[14]. Essentially, a weight-stationary
data, with improvement in the response time [1], [2]. Recent systolic array allows reusability of weights in implementing
AI resurgence has been due to deep neural networks (DNNs), subsequent layers of DNN. This architecture has also been
which process more hidden layers and result in increased adopted by Google Inc. for their industrial Tensor Processing
classification accuracy [3]. Moreover, there is a growing Unit (TPU) [9] due to its low bandwidth feature.
interest in convergence of DNN processing capabilities with Recent study has shown that error resilience of AI is
the edge computing devices to enhance application paradigm insufficient to overcome the effects of stuck-at-faults for
[4], [5]. NVIDIA, Google and Tesla have already introduced weight-stationary systolic array. As only 0.005% faulty PEs
specialized accelerators for edge inference applications can degrade the classification accuracy for up to 74.13% [15].
[6]–[8] with smaller physical and power footprint. The reason for such drop in accuracy is that the stuck-at-faults
Several DNN hardware accelerators are being developed frequently affect the higher order bits of the MAC output.
for inference tasks on application specific integrated cir- This shows that in addition to yield enhancement, the fault
cuits (ASIC) [9]–[11]. ASIC based AI hardware accelerators coverage (FC) is also crucial for reliable DNN operation.
Moreover, the edge-based AI hardware requires small physi-
The associate editor coordinating the review of this manuscript and cal and power footprint. The main limiting factor with scala-
approving it for publication was Luca Cassano. bility of full scan DFT approach in terms of test overhead is

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
96700 VOLUME 9, 2021
U. S. Solangi et al.: Test Architecture for Systolic Array of Edge-Based AI Accelerator

the addition of an extra scan MUX logic (per Flip Flop) and ensure every possible transition as sufficient condition to
additional routing (for scan chain stitching). Which in case test the sequential cells. Moore and Bawa [22] presented
of an accelerator, multiplies with increasing array size of the testing method for a bit-level unilateral systolic array, where
accelerator and may exceed allowed limits of size and power length of the test vectors increases with the size of an array.
for an edge based AI accelerator. A full scan based C-testing It uses a row comparator for each column for generating test
approach was proposed in [16], where testability effort has pass/fail result, thereby compressing the test response for the
been confined to single PE. This C-testing approach results array. The main limitation of C-testability based functional
an improvement in test time and test pattern reduction. In this testing is the detection of a single faulty cell from the whole
paper, we propose partial scan based DFT architecture having array. BIST solutions (based on single cell fault model) for
low overhead (PPA) for edge-based AI hardware. The key array multipliers with deterministic (constant) patterns are
contributions of this paper are; presented in [23] and [24] in which MUX logic is introduced
• Investigations of conventional test solutions for sys- as a DFT solution to switch between functional and test mode.
tolic array; Sequential ATPG and Full scan are first Besides, strategies for testing identical cores have been
implemented for weight-stationary systolic array (based proposed. Giles et al. [25] have addressed the testing of
on TPU model) with FC analysis and associated test multiple identical cores by providing a scalable parallel test
overhead. access mechanism (TAM) architecture. In this architecture,
• A test architecture based on partial scan and systolic the response paths from each core are pipelined through com-
pattern loading with a built-in checking circuitry is parators in order to compare the response of each core with
proposed for weight-stationary systolic array (based on a core, which is already tested by the ATE. Han et al. [26]
TPU model). proposed a TAM architecture for multiple identical cores that
• Partial broadcasting is proposed for test pattern loading uses majority voting for checking test response of each core
(for test time synchronization) for arrays of different and the majority response is cross checked with the ATE
sizes (>16 × 16). Test cost of the proposed test archi- response. The key takeaway is that majority of the cores
tecture is presented and compared with full scan. will be matched to the expected response and can distinguish
• A delay fault testing method based on Launch- the minority cores with faulty response through majority
on-Capture is presented for the proposed architecture. analyzer. A method for concurrent error checking between
• Evaluation of the proposed method is also performed neighboring elements in a systolic array is presented in [27].
in comparison with Checkerboard based full scan This requires additional XOR logic for output comparisons
method [16]. between neighboring elements and may result in an increased
test area overhead.
The remaining paper is organized as follows. In Section 2, Ma et al. [28] have tested an AI based SoC by broadcasting
various array testing methods are discussed. Section 3 briefly test patterns by embedded deterministic test (EDT) to the
introduces Google’s TPU model that was used for imple- identical cores to reduce test time. These cores are isolated by
mentation of this work. In Section 4, we present our anal- IEEE 1500 wrappers and are tested by means of comparator
ysis of conventional test methods. Section 5 presents the in subsequent test modes. However, this testing approach
details of the proposed test architecture and its operation. results in a very high routing congestion due to input channel
Section 6 gives the details for the proposed solution for broadcasting, and due to the hardware overhead associated
at-speed testing with partial scan based test architecture. with EDT, it is not an optimum solution for the edge-based
In Section 7, the results for associated experiments are given. AI accelerator. Moreover, this state-of-the-art solution uses
Finally, we present the conclusion in Section 8. full scan DFT approach, which may not be a suitable solution
for systolic array. The reason is that the circuit connectivity
II. RELATED WORK may not allow each FF to provide same level of controlla-
Testing of iterative arrays have been previously studied with bility and observability, which is the case for most of the
C-testability, which is primarily based on functional test- pipeline flow-based accelerators with unidirectional connec-
ing with constant number of test patterns to test each PE tions. A framework for functional criticality based stuck-at
[17], [18]. Friedman [18] presented a theory for modified fault analysis for inference applications is presented in [29].
C-testability based on the function of the processing cell, This machine learning based gate-level netlist analysis prior
which detects single faulty cell of an array. Sung [19] pre- to manufacturing test to target location specific structural
sented sufficient conditions to ensure testability of unilateral faults for testing may optimize the test generation by spec-
and bilateral arrays for detection of a single faulty unit. ifying the test points/ test pattern generation for these critical
Elhuni et al. [20] have shown that the test pattern length can locations. However, this machine learning based analysis may
be made independent of the size of the array, but this method add to the time-to-market constraint and affect the overall test
is limited to one dimensional iterative array. Lombardi [21] cost.
has extended the C-testability approach to systolic arrays Recently, a C-testing approach based on full scan DFT is
provided there are additional patterns to be used for testing proposed in [16]. Homogeneity of PE is exploited for test-
the sequential cells (FFs) of a processing unit. These patterns ing sub-arrays in multiple iterations, which are executed in