Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Wearout Resilience in Nocs Through An Aging Aware Adaptive Routing Algorithm

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Wearout Resilience in NoCs Through an Aging We make the following contributions in this paper.
Aware Adaptive Routing Algorithm 1) We introduce our slack acquisition circuit (SAC), a
Dean Michael Ancajas, Kshitij Bhardwaj, crucial part in the design of aging-aware NoCs. The SAC
Koushik Chakraborty, and Sanghamitra Roy measures delay degradation and is combined with con-
gestion information to create a robust routing algorithm
Abstract— Continuous technology scaling has made aging with minimal power-performance overhead (Section IV).
mechanisms, such as negative bias temperature instability and 2) We analyze in more detail our previously proposed
electromigration primary concerns in network-on-chip (NoC) aging-aware adaptive routing algorithm [2] and its
designs. In this paper, we extensively analyze the effects of these required router microarchitecture. The adaptive routing
aging mechanisms on NoC routers and links. We observe a algorithm minimizes degradation of NoC routers while
critical need of a robust aging-aware routing algorithm that
not only reduces power-performance overheads caused due to mitigating power-performance impacts brought about
aging degradation, but also minimizes the stress experienced by aging. We include new simulation results of mean
by heavily utilized routers and links. To solve this problem, time to failure (MTTF) and aging uniformity of router
we propose an aging-aware adaptive routing algorithm and ages.
a router microarchitecture that routes the packets along the 3) An extensive experimental analysis using a top-down
paths, which are both least congested and experience minimum
aging degradation. After an extensive experimental analysis using simulation infrastructure and real workloads (PARSEC
real workloads, we observe 13% and 12.17% average overhead benchmarks [3]) indicates an average of 13% and
reduction in network latency and energy–delay product per flit, 12.17% reduction in the network latency and energy–
a 10.4% improvement in performance, and a 60% improve- delay product per flit (EDPPF) [4] in a typical NoC
ment in mean time to failure using our aging-aware routing undergoing aging degradation. We also obtain an aver-
algorithm.
age improvement of 10.4% in performance using our
Index Terms— Multicore processing, multiprocessor proposed algorithm (Section VI).
interconnection networks, parallel architecture.

I. I NTRODUCTION II. R ELATED W ORK


Emerging systems with hundreds of billions of transistors
W ITH the proliferation of on-chip cores allowed through
rapid technology scaling, network-on-chips (NoCs) are
becoming a critical determinant of overall system power-
are likely to have many faults even at the point of tapeout [5].
These faults can impact both processing cores as well as
performance characteristics. Consequently, the growing relia- certain NoC components, drastically reducing their function-
bility challenges, which are continuously reshaping the system ality. Many recent NoC works target these faulty components,
design considerations, must now be thoroughly analyzed in outline a plethora of techniques to tolerate these faults, and
the context of NoC designs [1]. In this paper, we study two sustain successful communication between two nodes at the
primary mechanisms responsible for circuit wearouts in NoC expense of performance. In contrast to previous studies, our
designs: 1) negative bias temperature instability (NBTI) and work maintains the high performance of a system by extending
2) electromigration. the lifetime of fault-free components in the NoC through
Unlike previous works, we comprehensively analyze the wearout resilient routing algorithms.
impact of these mechanisms both at the circuit and at the Several recent works aim to develop fault-tolerant and
system level. This allows us to design a robust NoC while process variation-aware solutions for NoCs through rout-
also considering power-performance issues in the system. For ing. To mitigate faulty components, some works explore:
instance, alleviating usage in a heavily degraded path can cause adaptive routing techniques [6], [7], probabilistic flooding
unusual congestion in alternate routes if routing paths are [8]–[10], mean latency analysis in stochastic communication
not assigned optimally. Hence, efficient ways to improve the [11], and dynamically reconfiguring routing paths [12], [13].
system robustness require design space exploration techniques Shi et al. [14] recently proposed a scalable and distributed
that simultaneously optimize multiple objectives. Such opti- fault-tolerant routing algorithm for NoCs that divides the
mization problems must model many aspects in NoC design: system into regions, and each region guarantees fault tolerance
routing topology, network traffic, device-level degradation, of its own area. Other recent works use lightweight checker
latency, and energy consumption. Our work exploits the networks or invariance checking to detect and recover from
interplay of these elements to design reliable NoCs. faults [15], [16].
Process variation-aware routing algorithms are also attrac-
Manuscript received May 31, 2013; revised October 15, 2013; accepted
January 8, 2014. This work was supported in part by the National Science tive choices for dealing with the uncertainty in CMOS fabrica-
Foundation under Grant CNS-1117425, Grant CAREER-1253024, and Grant tion. Recently, Sharifi and Kandemir [17] proposed a routing
CCF-1318826, and in part by the Micron Foundation. scheme that selects the best path for each communication
The authors are with the Department of Electrical and Computer
Engineering, Utah State University, Logan, UT 84322 USA (e-mail: based on the process variation dictated speeds of routers
dbancajas@gmail.com; kshitij.bhardwaj@aggiemail.usu.edu; koushik. and the current traffic pattern. Aisopos et al. investigated
chakraborty@usu.edu; sanghamitra.roy@usu.edu). system-level modeling of variation induced faults, and pro-
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. posed a holistic tool to assess fault manifestation from the
Digital Object Identifier 10.1109/TVLSI.2014.2305335 NoCs.
1063-8210 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE I
D IFFERENT D EGRADATION S CHEMES

Fig. 2. Time taken for the network to become faulty under various aging
models (high injection rate).

Fig. 1. Performance effect of a degraded NoC.

III. M OTIVATION
In this section, we motivate the need for using appropriate
aging mechanisms in the analysis of NoC degradation. We
first present a robustness analysis using the network latency Fig. 3. SAC. The paths are sampled through a series of delay buffers. The
of an NoC with respect to different aging mechanisms. Then, slack information is then combined with congestion information to calculate
TTpE.
we discuss the performance implications of a degraded NoC.

A. Effect on Fault Tolerance of NoC


Links have been shown to be a crucial part in NoC degrada-
tion modeling [2]. We assume that a network becomes faulty
when the increase in network latency exceeds a predefined
threshold (20% in our study). Fig. 2 shows the time taken
for the network to become faulty using different degradation
models shown in Table I. For example, under scheme D, a
network can be rendered faulty in almost three years. However,
using scheme A grossly overestimates the time to failure
(almost six years). In reality, due to the combined effects
of NBTI, electromigration and other wearout mechanisms,
such as hot-carrier injection degradation and time-dependent
dielectric breakdown, the total network delay is likely to Fig. 4. NoC router with SAC. The SACs are placed in each pipeline stage.
increase more than the threshold by that time.

B. Performance Implications of Degraded NoCs combined with congestion information, is then used by our
routing algorithm to decide the optimal path of a packet.
A degraded network can have huge side effects in the
To be able to guide the aging-aware routing algorithms,
overall performance of running applications. Fig. 1 shows the
the SAC measures the increase in delay of each router. The
performance hit taken by applications when the flits take 20%
SAC is placed in every stage of the router pipeline (Fig. 4)
more time to route through the network. This experiment is
to capture the minimum slack amongst all stages. A more
done on the same 4 × 4 mesh configuration with PARSEC
detailed diagram of the SAC is shown in Fig. 3. The control
benchmark suite. The disparity in performance will exacerbate
unit shown in Fig. 3 alters the multiplexer select signal in each
as the size of the network grows.
cycle to choose which path to measure. Then, a series of m
cascaded delay buffers (db1, db2 , …, dbm ) sample the signal
IV. S LACK ACQUISITION C IRCUIT at equal time intervals. The state transition captured at the
In this section, we discuss our SAC, a crucial part in output of each delay buffer provides an estimate of the delay
the design of aging-aware NoCs. The SAC measures the of the path. Finally, a comparator and the TTpE module [2]
delay degradation of an NoC router. This delay degradation, obtains the minimum slack after n cycles.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 3

The calculation of the router slack is formally defined as Algorithm 1: Aging-Aware Adaptive Routing
follows:
Srouter = min(s1 , s2 , . . . , s N ) (1)
sn = min(s p1 , s p2 , . . . , s p M ) (2)
1
spath = − cp (3)
f
where cp and f are the delay at the time of sampling and
the target frequency of the router, respectively. Srouter is the
overall slack of the router, sn is the overall slack of each
stage, and spath is the slack of each path. The SAC circuit
samples the path delay of the incoming link and the router
buffers. Incoming links are included in the first stage of the
router. To determine overall minimum slack, the SAC uses a
multiplexer to sample different signal paths.
1) Process Variation Issues: In smaller scale technologies,
process variation can cause the delay buffers to have differing
delay resolutions after manufacturing. This can be mitigated
by postsilicon tuning exploiting a programmable buffer series
[18]. After manufacturing of the ICs, the buffer series can then TABLE II
be divided into multiple stages so that each stage will have the H ARDWARE OVERHEADS
same delay resolution.
To calculate the overhead of our SAC, we implement
its Verilog RTL model and integrate it with a well-known
NoC router model [19]. Our implementation shows 1.2% and
0.047% area and power overheads using a 45-nm TSMC
technology targeted at 2 GHz, respectively. For timing, the
SAC decreases the overall slack by 8.6% as the timing delay
buffers are distributed appropriately to cover the whole clock this by dividing the routing unit into two stages. The first
period. However, this slack decrease is only experienced at the stage selects the output link based on the policies discussed
sampling rate (1 out of 1010 times at 2 GHz), and hence does in Section V-A. In the second stage, the utilization of the
not affect the original performance of the router. selected output link is evaluated. If it is determined that it
has reached its provisioned traffic in the current epoch then
V. AGING -AWARE A DAPTIVE ROUTING the router will insert recovery cycles, otherwise the flits are
In this section, we present a robust aging-aware routing forwarded immediately.
algorithm. Our algorithm reduces the stress while adding
minimal power-performance overheads.
VI. E XPERIMENTAL R ESULTS

A. Adaptive Aging-Aware Algorithm To study the power-performance impact of aging on NoC


designs, we conduct a set of experiments on a 4 × 4 NoC
Algorithm 1 shows an overview of our aging and con- mesh. We evaluate our designs on the following metrics:
gestion aware adaptive routing algorithm. The algorithm can system performance, EDPPF, network latency, MTTF, and
be divided into two distinct steps: 1) path selection and aging variation across the network. The comparative schemes
2) recovery cycle insertion. Path selection examines the short- implemented for comparison are discussed next.
est paths that have least degradating and also least congestion.
Preference is given to paths with less degradation. Recovery
cycle insertion inserts idle cycles to routers that are overloaded A. Evaluation
in each epoch. This is done to balance out the stress across
We evaluate our schemes on several power performance and
the whole NoC. More details from our routing algorithm are
in [2]. reliability metrics. We employ a methodology that approx-
imates long-term aging impact in NoCs on a month-by-
month basis to keep tract of simulation time. Our cross-
B. Router Microarchitecture layer analysis involves SPICE for process variation, NBTI,
To implement the proposed routing algorithm, we extend a and electromigration analysis. Statistical timing analysis using
congestion-aware router such that it computes the best route synthesized RTL models of NoC routers to determine the
that is both aging and congestion aware. Moreover, this aging- impact of aging on slack. Application performance is then
aware router must also ensure that the best allocated output evaluated using full-system simulation. More details about our
link is operating under its respective TTpE. We implement cross-layer methodology can be read on [2].
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 5. Latency, EDPPF, and performance (lower is better). (a) Network latency. (b) EDPPF. (c) Performance.

Fig. 6. Reliability (higher is better), MTTF (higher is better), and slack standard deviation (lower is better). (a) Reliability. (b) MTTF. (c) Slack standard
deviation.

B. Hardware Overhead of SAC and TTpE C. Comparative Schemes


Table II shows the hardware overhead with respect to a base- We use three different schemes to show the importance of
line NoC in [19]. We compare our added modules to ForEVeR, a robust aging-aware adaptive routing algorithm.
a prominent scheme that uses a light-weight checker network 1) RCA-1D: This is the baseline we use and is based on
to enforce fault tolerance in an NoC system [15]. We only [21] on congestion awareness.
compared with the monitoring modules of ForEVeR as that is 2) AGE-ADAP: This scheme uses both congestion-
the only portion that keeps track of the reliability state of the awareness and aging-aware adaptive routing. However,
router, which has the same functionality as our SAC. All router it does not honor any TTpE limits imposed by the aging
modules are synthesized using the TSMC 45-nm technology. awareness of the NoC.
Due to the slow progression of the aging process, a monitor- 3) AGE-ADAP-REC: This scheme extends AGE-ADAP
ing circuit does not need to sample at frequent time intervals such that it inserts recovery cycles for links/routers dur-
(e.g., a sampling period of 10 s is satisfactory [20]). Thus, the ing an epoch if the utilization has reached TTpE. There-
SAC component is rarely activated, remaining in a clock-gated fore, this scheme ensures that none of the routers/links
state at all other times. Consequently, it does not have any sig- operates beyond their calculated TTpE.
nificant impact on timing and the dynamic power of the router
microarchitecture (1.2% area and 0.05% power overhead). The
propagation of slacks in the network is done through the flit D. Robustness Evaluation
link network, by triggering a dedicated multiplexer to latch the Fig. 6(a) shows the reliabilities of all three schemes for
slack vector. Given the sampling rate, this slack propagation an aging period of seven years, calculated using the reliabil-
has a negligible effect on the system performance. ity’s dependence on failure rate and TTpE [2]. As expected,
As with the SAC, the TTpE module is only triggered once in RCA-1D shows substantially higher failure rate compared
every epoch. As such, even if the process takes multiple cycles, with AGE-ADAP and AGE-ADAP-REC, as its design does
it does not add any performance impact to the NoC. Assuming not adapt to the wearout degradation of NoC components.
conservative latencies for the FP ALU, the whole process of In addition, the traffic utilization per epoch of routers and
calculating the TTpE would take less than 20 clock cycles. links are always above the TTpE for RCA-1D, which further
Moreover, this latency is hidden in the NoC using the TTpE reduces its reliability. AGE-ADAP-REC has the best reliability
value from the previous epoch until a new TTpE has been because its utilizations are well below TTpE limits.
calculated. The TTpE module has also been synthesized in 45-
nm technology model and we found the overhead in power and
area to be 0.02% and 4.01% compared with the Stanford open- E. Power-Performance Analysis
sourced NoC router [19], respectively. The TTpE module has We analyze power-performance overhead using three
the largest area overhead due to computational units needed metrics: 1) network latency; 2) EDPPF; and 3) application
to calculate the TTpE. performance.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 5

Fig. 5(a) shows the network latency of all schemes. RCA- and links. Using the TTpE metric, we propose an aging-
1D has worst latency because it does not implement any aging aware adaptive routing algorithm and router microarchitec-
awareness. RCA-1D only repeatedly selects paths that are least ture. Extensive experimental analysis incorporating power-
congested, which will heavily degrade as time goes by leading performance impact of aging demonstrate 13% and 12.17%
to higher packet latency. On the other hand, AGE-ADAP and improvements in network latencies and EDPPF for our algo-
AGE-ADAP-REC route packets intelligently using paths that rithm, respectively. At the system level, our algorithm shows
are least degraded and least congested, and therefore incur 10.4% performance improvement for real workloads.
lower overheads. On an average, AGE-ADAP-REC reduces
the latency by 13% relative to RCA-1D due to recovery cycle R EFERENCES
insertion. [1] J. D. Owens, W. J. Dally, R. Ho, D. N. Jayasimha, S. W. Keckler, and
1) EDPPF: We show EDPPF overhead in Fig. 5(b). AGE- L.-S. Peh, “Research challenges for on-chip interconnection networks,”
IEEE Micro, vol. 27, no. 5, pp. 96–108, Oct. 2007.
ADAP and AGE-ADAP-REC are able to achieve reduced
[2] K. Bhardwaj, K. Chakraborty, and S. Roy, “Towards graceful aging
overhead as compared with RCA-1D due to better latency and degradation in NoCs through an adaptive routing algorithm,” in Proc.
power profile. Due to additional recovery cycles, AGE-ADAP- 49th IEEE DAC, Jun. 2012, pp. 382–391.
REC incurs a higher EDPPF as compared with AGE-ADAP. [3] (2010). PARSEC [Online]. Available: http://parsec.cs.princeton.edu/
[4] B. Li, L.-S. Peh, and P. Patra, “Impact of process and temperature
On an average, AGE-ADAP-REC reduces EDPPF by 12.17% variations on network-on-chip design exploration,” in Proc. 2nd IEEE
relative to RCA-1D. Int. Symp. Netw. Chip, Apr. 2008, pp. 117–126.
2) Performance: Fig. 5(c) shows the system performance [5] S. Borkar, “Thousand core chips: A technology perspective,” in Proc.
for the schemes relative to RCA-1D. We observe that RCA- 44th Annu. DAC, 2007, pp. 746–749.
[6] C.-L. Chou and R. Marculescu, “Farm: Fault-aware resource manage-
1D shows lower performance as compared with AGE-ADAP ment in NoC-based multiprocessor platforms,” in Proc. DATE Conf.,
and AGE-ADAP-REC schemes. Since RCA-1D is only con- 2011, pp. 673–678.
gestion aware, it selects the least congested paths over the [7] F. Chaix, D. Avresky, N.-E. Zergainoh, and M. Nicolaidis, “A fault-
least degraded ones, and therefore incurs higher performance tolerant deadlock-free adaptive routing for on chip interconnects,” in
Proc. DATE Conf., 2011, pp. 909–912.
overheads at the system level. Across different benchmarks, [8] S. Pasricha, Y. Zou, D. Connors, and H. J. Siegel, “OE+IOE: A novel
AGE-ADAP-REC shows 10.4% performance improvement turn model based fault tolerant routing scheme for networks-on-chip,”
over RCA-1D, demonstrating its effectiveness. in Proc. 8th Int. Conf. Hardw./Softw. Codes. Syst. Synth., Oct. 2010,
pp. 85–94.
[9] S. Murali, D. Atienza, L. Benini, and G. D. Micheli, “A multi-path
F. System MTTF routing strategy with guaranteed in-order packet delivery and fault-
tolerance for networks on chip,” in Proc. 43rd DAC, 2006, pp. 845–848.
Fig. 6(b) shows a comparison of the lifetime using each [10] S. Manolache, P. Eles, and Z. Peng, “Fault and energy-aware commu-
of our proposed routing schemes. The values are normalized nication mapping with guaranteed latency for applications implemented
with respect to RCA-1D. Our two schemes are comparable on NoC,” in Proc. 42nd DAC, Jun. 2005, pp. 266–269.
with each other, and each extends the system lifetime when [11] P. Bogdan and R. Marculescu, “Hitting time analysis for fault-tolerant
communication at nanoscale in future multiprocessor platforms,” IEEE
compared to the baseline by 53% and 60%, on average. The Trans. Comput. Aided Des., vol. 30, no. 8, pp. 1197–1210, Aug. 2011.
largest improvement is on benchmark facesim that is 84%, [12] D. Fick, A. DeOrio, G. K. Chen, V. Bertacco, D. Sylvester, and
while the smallest is on ferret at 10%. By limiting the amount D. Blaauw, “A highly resilient routing algorithm for fault-tolerant NoCs,”
in Proc. DATE Conf., pp. 21–26, Apr. 2009.
of traffic on the routers based on the TTpE, both of our
[13] W.-C. Tsai, D.-Y. Zheng, S.-J. Chen, and Y. H. Hu, “A fault-tolerant
schemes are able to reduce temperature and consequently NoC scheme using bidirectional channel,” in Proc. 48th DAC, 2011,
aging, thereby extending the lifetime of routers that are about pp. 918–923.
to fail. AGE-ADAP-REC has slightly better MTTF values due [14] Z. Shi, X. Zeng, and Z. Yu, “A scalable and reconfigurable fault-tolerant
distributed routing algorithm for NoCs,” IEICE Trans., vol. 94-D, no. 7,
to stricter adherent toward the TTpE values. pp. 1386–1397, Jun. 2011.
[15] R. Parikh and V. Bertacco, “Formally enhanced runtime verification to
G. Aging Variation ensure NoC functional correctness,” in Proc. 44th Annu. IEEE/ACM Int.
Symp. Microarchit., Dec. 2011, pp. 410–419.
Another way to evaluate the effectiveness of various relia- [16] A. Prodromou, A. Panteli, C. Nicopoulos, and Y. Sazeides, “Nocalert:
bility schemes is to check the slack difference of all routers in An on-line and real-time fault detection mechanism for network-on-chip
architectures,” in Proc. 45th Annu. IEEE/ACM Int. Symp. Microarchit.,
the network. Ideally, a reliable system has all of its components Dec. 2012, pp. 60–71.
fail at the same time rather than having one component break [17] A. Sharifi and M. T. Kandemir, “Process variation-aware routing in NoC
early taking down the whole system with it. Fig. 6(c) shows based multicores,” in Proc. 48th DAC, 2011, pp. 924–929.
the standard deviation of all router slacks after N years. All [18] K. A. Bowman, J. W. Tschanz, S.-L. Lu, P. A. Aseron, M. M. Khellah,
A. Raychowdhury, et al., “A 45 nm resilient microprocessor core for
values are scaled with respect to RCA-1D. As RCA-1D only dynamic variation tolerance,” J. Solid-State Circuits, vol. 46, no. 1,
routes flits based on congestion, the router slacks have so much pp. 194–208, Jan. 2011.
variation. Our schemes reduce this variation by introducing [19] (2012). Open Source NoC Router RTL [Online]. Available:
reliability awareness. AGE-ADAP-REC improves the unifor- https://nocs.stanford.edu/cgi-bin/trac.cgi/wiki/Resources/Router
[20] E. Karl, P. Singh, D. Blaauw, and D. Sylvester, “Compact
mity of router slacks by 49%, while AGE-ADAP by only 42%. in-situ sensors for monitoring negative-bias-temperature-instability
effect and oxide degradation,” in Proc. IEEE ISSCC, Feb. 2008,
VII. C ONCLUSION pp. 410–623.
[21] P. Gratz, B. Grot, and S. W. Keckler, “Regional congestion awareness
In this paper, we comprehensively analyze the effects for load balance in networks-on-chip,” in Proc. IEEE 14th Int. Symp.
of multiple aging degradation mechanisms on NoC routers HPCA, Feb. 2008, pp. 203–214.

You might also like