Routing Algorithms in Networksonchip 2014 PDF
Routing Algorithms in Networksonchip 2014 PDF
Palesi · Masoud Daneshtalab
Editors
Routing
Algorithms in
Networks-on-
Chip
Routing Algorithms in Networks-on-Chip
Maurizio Palesi • Masoud Daneshtalab
Editors
Routing Algorithms
in Networks-on-Chip
123
Editors
Maurizio Palesi Masoud Daneshtalab
Facoltà di Ingegneria Department of IT
Kore University of Enna University of Turku
Cittadella Universitaria Turku, Finland
Enna, Italy
In the year 2000 when the idea of Networks on Chip (NoC) was proposed, many
people looked at it sarcastically as “bizarre,” “too complicated,” and “unacceptably
inefficient.” In a few years they were proved wrong. But even the greatest optimists
did not predict that this new paradigm, for designing inter-core connections for
multi-core systems using packet-switched communication, will get unanimous
endorsement from academicians as well as semiconductor industries in just a matter
of 10 years. But it has happened with the announcement of NoC-based 48-core
chip by Intel. Now annually there are many international workshops and special
track sessions held in important international conferences specifically dealing with
issues related to NoC architecture. Within the NoC architecture area, new ideas in
routing algorithm design continue to dominate the research publications. During the
earlier years, routing algorithm proposals attempted communication performance
improvement by maximizing routing adaptivity (while avoiding any possibility of
deadlock) and by reducing congestion in general or application-specific contexts.
Recently, researchers have expanded the scope of routing algorithm design by
including fault tolerance and lowering power consumption as added objectives along
with high performance. Motivated by advances in new technologies, proposals of
routing algorithms for 3D architectures and mixed electro-optical or pure optical
NoCs have also started appearing in literature.
As is obvious from research publications in various conferences and workshops,
NoC is becoming an important topic of research and postgraduate teaching in
universities all over the globe. Routing algorithm design is a challenging topic
for researchers since it provides the possibility of graph theoretic analysis of any
proposed new solution. Thanks to the availability of free NoC simulators, this area
provides the possibility of concrete and speedy experimental evaluation of new ideas
in NoC routing. Availability of ASIC design tools and FPGA prototyping tools also
allows evaluation of cost and power consumption implications of the new ideas.
Research in the area of routing algorithms is still flourishing and by no means has
reached the saturation point. This book, entitled Routing Algorithms in Networks-
on-Chip, is a collection of papers describing representative solutions to important
aspects and issues related to routing algorithms. This collection does not claim
v
vi Foreword
to include best solutions for any aspect of routing algorithms, nor does it claim
a complete coverage of topics related to this important area of NoC architecture.
But the book does provide a good source of reference to postgraduate students and
researchers getting started in this exciting area.
For many years, both Maurizio Palesi and Masoud Daneshtalab have been very
active in research related to various aspects of NoC architecture design in general,
and design of routing algorithms in particular. They have made significant and dis-
tinctive contributions in the area of routing algorithms. Their contributions, through
the organization of NoC-related workshops and special sessions in international
conferences, as well as through special issues for various international journals,
are well known and highly appreciated by the NoC community. The contacts and
knowledge gained by them through these experiences have placed them in a unique
position to put together this excellent collection of papers in a book form.
The book is organized in six logical parts such that each part contains papers
related to a common theme. For example, Part I contains papers proposing ideas
to improve routing performance in NoC platforms. Similarly, Part II collects ideas
related to multicast routing in NoC platforms. There are parts, each with multiple
chapters, dealing with fault tolerance routing in NoC, power/energy-aware routing,
and routing for 3D and optical NoC. The single chapter in the last part describes
an industrial case study of routing algorithm in a tera-scale architecture. This
organization makes the book directly useable as a reference or as a textbook in a
special topic graduate course.
I recommend this book to all those who are new to the area of NoC architecture
and NoC routing and want to understand the basic concepts and learn about
important research issues and problems in this area. The book will also be useful
as a reference source to established research groups as well as industry involved in
NoC research. I feel this book will make an important contribution in promoting
education and research in NoC architecture.
vii
viii Preface
are responsible for a significant fraction of the total power budget. Chapter 11
proposes a routing algorithm to reduce the hotspot temperature for application-
specific NoCs.
Emerging technologies are explored in the fifth part of the book. Chapter 12
introduces some design concepts of traffic- and thermal-aware routing algorithms
in 3D NoC architectures, which aim at minimizing the performance impact caused
by the run-time thermal managements. A new architecture for nanophotonic NoCs
which consists of optical data and control planes is proposed in Chap. 13.
Finally, in the last part of the book (Chap. 14), an industrial case study illustrates
a comprehensive approach in architecting (and micro-architecting) a scalable and
flexible on-die interconnect and associated routing algorithms that can be applied to
a wide range of applications in an industry setting.
“Life is like riding a bicycle. To keep your balance you must keep moving.”
“Imagination is more important than knowledge. Knowledge is limited. Imagination
encircles the world.”
– Albert Einstein
xi
Contents
xiii
xiv Contents
1.1 Introduction
M. Danashtalab ()
University of Turku, Turku, Finland
e-mail: masdan@utu.fi
M. Palesi
Kore University, Enna, Italy
e-mail: maurizio.palesi@unikore.it
The network topology is the study of the arrangement and connectivity of the
routers. In other words, it defines the various channels and the connection pattern
that are available for the data transfer across the network. Performance, cost, and
scalability are the important factors in the selection of the appropriate topology.
Shared-Bus, Crossbar, Butterfly Fat-Tree, Ring, Torus, and 2D-Mesh are the
most popular topologies for on-chip interconnects which have been commercially
used [14, 21].
Direct networks have at least one processing element (PE) attached to each
router of the network so that routers may regularly spread between PEs. This
helps to simplify the physical implementation. The shared-bus, ring, and 2D
mesh/torus topologies (Fig. 1.2) are examples of direct networks, and provide
tremendous improvement in performance, but at a cost of hardware overhead,
typically increasing as the square of the number of PEs. On the other hand, indirect
Shared-bus
Ring Crossbar
Mesh Torus
networks have a subset of routers which are not connected to any PE. All tree-
based topologies where PEs are connected only to the leaf routers (e.g., the butterfly
topology) as well as crossbar switch (Fig. 1.2) are indirect networks.
The shared-bus topology is the simplest using a common shared link common
to all PEs where they compete for exclusive access to the bus. For communication
intensive applications, it is necessary to overcome the bandwidth limitations of the
shared-bus topology and move to scalable networks. However, this topology scales
very poorly as the number of PEs increases. A small modification to the shared-
bus topology to allow more concurrent transactions is to create the ring topology
where every PE has exactly two neighbors. In this topology, messages hop along
intermediate PEs until they arrive at the final destination. This causes the ring to
saturate at a low injection rate for most traffic patterns. The crossbar topology is
a fully connected one which allows every PE to directly communicate with any
other PE.
The fat-tree topologies suffer from the fact that the number of routers exceeds
the number of PEs, when the amount of PEs increases. This incurs an important
network overhead. For the on-chip interconnects the network, overhead is more
critical than for the off-chip networks and the design scalability is more essential.
Mesh and torus networks are widely used in multiprocessor architectures because
of their simple connection and easy routing provided by adjacency. Both torus and
mesh topologies are fully scalable. Although torus provides a better performance,
the regularity, better utilization of links, and lower network overhead are some of the
preferences for mesh. That is, the mesh topology is more economic scheme since
the routers on the borders are smaller. In sum, each topology has its own advantages
and disadvantages.
the packets are assembled into the original message. If a message is divided into
several packets, the order of packets at arrival PE must be the same as departure.
Therefore, in-order delivery is an essential part that should be supported by on-
chip networks. The packet switching mechanism improves channel utilization and
network throughput.
In the packet switching domain, buffered flow control defines the mechanism
that deals with the allocation of channels and buffers for the packets traversing
between source and destination. The flow control mechanism is necessary when
two or more packets compete to use the same channel, at the same time. Commonly
three different buffered flow control strategies are used: store-and-forward, virtual
cut through, and wormhole. When these mechanisms are implemented in on-chip
networks, they have different performance metrics along with different requirements
on hardware resources.
1.3.1 Store-and-Forward
The virtual cut-through mechanism [14] was proposed to address the large network
latency problem in the store-and-forward strategy by reducing the packet delays
at each routing stage. In this approach, one packet can be forwarded to the next
stage before it is entirety received by the current router which reduces the store-
and-forward delays. However, when the next stage router is not available, similar to
the store-and-forward mechanism, the virtual cut-through approach also requires a
large buffering space at each router to store the whole packet.
1.3.3 Wormhole
In this mechanism, a packet is divided into smaller segments called FLITs (FLow
control digIT) [27]. Then, the flits are routed through the network one after another
in a pipelined fashion. The first flit in a packet (header) reserves the channel of each
6 M. Danashtalab and M. Palesi
router, the body (payload) flits will then follow the reserved channel, and the tail
flit will later release the channel reservation. The wormhole mechanism does not
require the complete packet to be stored in the router while waiting for the header
flit to route to the next stages. One packet may occupy several intermediate routers at
the same time. That is, the wormhole approach is similar to the virtual cut-through,
but here the channel and buffer allocation is done on a flit-basis rather than packet-
basis. Accordingly, the wormhole approach requires much less buffer space, thus,
enabling small, compact and fast router designs. Because of these advantages, the
wormhole mechanism is an ideal flow control candidate for on-chip networks.
VC0
Port 0
VCn
VC0
Port 1
VCn
VC0
Port 2
VCn
VC0
Port 3
VCn
Fig. 1.3 A typical router
using VCs
1 Basic Concepts on On-Chip Networks 7
a
Router 1 Router 2 Router 3 Router 4
B B A B
Packet A
Packet B
b
Router 1 B
Router 2 B Router 3 B
Router 4 B
A
A
and throughput. Figure 1.4a shows how a packet A blocked between routers 3 and 4
which also blocks the packet B when the network is not equipped with VCs. As
illustrated in Fig. 1.4b, using VCs allows dual utilization of the physical channel
between routers 3 and 4 where the packet B can pass the router 3. However, although
employing VCs improves the performance and reduces Head-of-Line blocking
(HoL) efforts in the network, it increases design complexity of the link controller
and flow control mechanisms.
The NoC design is commonly discussed in the form of two-dimensional (2D) and
three-dimensional (3D) architectures. As shown in Fig. 1.5a, in 2D NoCs design,
all switches are laid down in a single layer and connected to each other via intra-
layer connections. In 3D NoCs (Fig. 1.5b), layers are stacked on top of each other
via inter-layer connections instead of being spread across a 2D plane [4, 17, 31].
Each layer can use different technology, topology, clock frequency, etc. In recent
years, through-silicon-via (TSV) has attracted a lot of attention to be employed
for the inter-layer connections (vertical channels). TSVs enable faster and more
power efficient inter-layer communication across multiple stacked layers. Figure 1.5
illustrates a 2D and 3D network with almost similar numbers of cores.
When multiple packets request for an output port, the need of an output scheduling
algorithm that determines the priority order of candidate packets to advance
emerges. In fact, the scheduler gives a priority order to each packet, and then the
8 M. Danashtalab and M. Palesi
Routing is the process that is used to forward the packets along appropriate
directions in the network between a source and a destination.
In general, a routing algorithm can be seen as the cascade of two main blocks
which implement the routing function [5, 6, 18, 32] and the selection function [1,
19, 29, 34] as shown in Fig. 1.6. First, the routing function computes the set of
admissible output channels towards which the packet can be forwarded to reach
the destination. Then, the selection function is used to select one output channel
from the set of admissible output channels returned by the routing function. In a
router implementing a deterministic routing algorithm (Sect. 1.7.2), the selection
block is not present since the routing function returns only a single output port
(see Fig. 1.6a). In a router implementing an oblivious routing algorithm (Sect. 1.7.2)
the selection block takes its decision based solely on the information provided
by the header flit (see Fig. 1.6b). Finally, network status information (e.g., link
utilization and buffer occupation) are exploited by the selection function of a router
implementing an adaptive routing algorithm (Sect. 1.7.2) (see Fig. 1.6c).
1 Basic Concepts on On-Chip Networks 9
Routing algorithms not only affect the transmission time but also can impact the
power consumption and congestion conditions in the network.
Routing can be utilized either at the source router or with a distributed manner by
routers along the path. In the source routing scheme the entire route of a packet
is decided by the source router stacking the exact router-to-router itinerary of a
packet in the header. As the packet traverses in the network this information is
used by each router on the path to navigate the packet towards the destination.
This scheme is a simple solution for on-chip networks while the problem of the
routing information overhead is the drawback of this scheme, i.e. for a network
with a diameter of k, each packet requires at most k routing information stacked on
the header of the packet. Accordingly, if the network grows, the header overhead
becomes significant which is impractical for on-chip networks. In contrast, in the
distributed routing approach the routing decision is taken by the individual routers
depending on different parameters while the header of a packet has to include only
the destination address. Each intermediate router examines the destination address
(sometimes source address is also needed) and decides along which channel to
forward the packet. However, the router complexity of the latter scheme is higher
than the former scheme.
10 M. Danashtalab and M. Palesi
Figure 1.7 shows an example of deadlock situation [28] occurring when, at the same
time t, four packets Packet 1, Packet 2, Packet 3, and Packet 4 present at the west
port of router 3, south port of router 4, east port of router 2, and north port of router 1
respectively. Destinations of the packets are two hops counterclockwise: Packet 1 is
destined to node 2, Packet 2 to node 1, Packet 3 to node 3, and Packet 4 to node 4.
Assuming a routing function which returns the output port to reach the destination
in a minimal hop count favouring counterclockwise direction, at time t + 1 the input
logic of router 3 selects the west input creating a shortcut between the west input
port and the east output port. Based on wormhole rules, such a shortcut is maintained
until all the flits of Packet 1 have traversed router 3. At the same time, a shortcut
from south input port of router 4 with its north output port is created. Similarly, a
shortcut between east input port to west output port, and from north input port to
south output port are created at time t + 1 in router 2 and router 1 respectively. At
time t + 2, the first flit of Packet 1 is stored into the west input buffer of router 4.
The routing function determines that the flit has to be forwarded to the north output
port, but this port is already assigned for forwarding the flits of Packet 2. Thus, the
first flit of Packet 1 is blocked in the west input buffer of router 4. Similarly, the first
flit of Packet 2 is blocked in the south input port of router 2, the first flit of Packet 3
12 M. Danashtalab and M. Palesi
is blocked in the east input port of router 1, and the first flit of Packet 4 is blocked
in the north input port of router 3. Assuming 1 flit input buffer size, the flits of the
four packets cannot advance and, consequently, there is a deadlock.
A common way to verify the deadlock freedom property of a routing function
is by means of Duato’s theorem [12] which is an extension of Dally and Seitz
theorem [7] for adaptive routing functions. It is based on the analysis of the
channel dependency graph (CDG) associated to the routing function and the network
topology. Specifically, the following definitions are needed to introduce the theorem.
Definition 1. A Topology Graph T G = G(P, L) is a directed graph where each
vertex pi represents a node of the network and each directed arc li j = (pi , p j )
represents a physical unidirectional link connecting node pi to node p j .
Let Lin (p) and Lout (p) be the set of input links and output links for node p
respectively. Mathematically:
where src(l) and dst(l) indicate the source and the destination network node of the
link l.
Definition 2. A Routing Function for a node p ∈ P, is a function R(p) : Lin (p) ×
P → ℘(Lout (p)). R(p)(l, q) gives the set of output links of node p that can be used
to send a message received from the input link l and whose destination is q ∈ P.
The ℘ indicates a power set. We indicate with R the set of all routing functions:
R = {R(p) : p ∈ P}.
Definition 3. Given a topology graph T G(P, L) and a routing function R, there is
a direct dependency from li ∈ L to l j ∈ L if l j can be used immediately after li by
messages destined to some node p ∈ P.
Definition 4. A Channel Dependency Graph CDG(L, D) for a topology graph T G,
and a routing function R, is a directed graph. The vertices of CDG are the links
of T G. The arcs of CDG are the pair of links (li , l j ) such that there is a direct
dependency from li to l j .
Based on the above definitions, the following theorem gives a sufficient condition
for deadlock freedom.
Theorem 1 (Duato’s Theorem [12]). A routing function R for a topology graph
T G is deadlock-free if there are no cycles in its channel dependency graph CDG.
Livelock is a condition where a packet keeps circulating within the network
without ever reaching its destination. It is the result of using a non-minimal adaptive
routing algorithm. A livelock free routing algorithm has to guarantee forward
progress of each packet, where after each hop the packet is in one step closer to
its destination.
1 Basic Concepts on On-Chip Networks 13
communicate and which do not. By off-line profiling and analysis, one can also
estimate some quantitative information like communication bandwidth require-
ments between communicating pairs. After the applications have been mapped
and scheduled on the NoC platform, information about communications which are
never concurrent is also available. The APplication Specific Routing Algorithms
(APSRA) methodology [30] allows a specific application to generate high efficient
routing algorithms tailored. The basic idea behind APSRA is computing the channel
dependency graph (CDG) [13] by considering just the direct channel dependencies
generated by the communicating cores. Such CDG, called application specific CDG
(ASCDG), will contain less cycles than the CDG. For this reason, there will be less
cycles to be removed (i.e., less prohibited turns) with a consequent reduction of the
impact on the adaptivity of the routing function.
Let us consider the communication graph and the topology graph depicted in
Fig. 1.10. For the sake of simplicity, let us consider that the task Ti is mapped on
node Pi , i = 1, 2, . . . , 6.
The CDG [13] for a minimal fully adaptive routing algorithm is shown in
Fig. 1.11. Since it contains several cycles, Duato’s theorem [13] cannot assure
deadlock freedom for minimal fully adaptive routing for this topology. To make
the routing deadlock free, it is necessary to break all cycles of the CDG. Breaking a
cycle by means of a dependency removal results in restricting the routing function
and consequently loss of adaptivity. As the cycles to be removed are many, the
adaptivity of the resulting deadlock free routing algorithm will be strongly reduced.
1 Basic Concepts on On-Chip Networks 15
Many metrics have been used for estimating, evaluating and comparing the perfor-
mance of NoCs. These metrics include different versions of latency (e.g., spread,
minimum, maximum, average and expected), various versions of throughput, jitter
in latency, jitter in throughput etc. The most common metrics used are the average
delay and average throughput. The average delay, is the mean of the average
communication delay over all the communications. The average communication
delay is the average delay experimented by the packets of the communication to
reach their destinations. The packet delay is the time elapsed from the instant in
which the header of the packet is injected into the network, to the instant in which the
tail of the packet (i.e., the tail flit) reaches the destination. The average throughput
is the mean of the throughput over all the destination nodes. That is, the average
number of packets received by the destination in the time unit.
16 M. Danashtalab and M. Palesi
a 200 b 0.18
XY XY
180 Odd−Even, sel=random
0.16 Odd−Even, sel=random
Odd−Even, sel=bufferlevel Odd−Even, sel=bufferlevel
Throughput (flits/cycle/IP)
160
Average delay (cycles)
100 0.1
80 0.08
60
0.06
40
20 0.04
0 0.02
0.005 0.01 0.015 0.02 0.025 0.03 0.005 0.01 0.015 0.02 0.025 0.03
Packet injection rate (packets/cycle/IP) Packet injection rate (packets/cycle/IP)
Fig. 1.13 Delay variation (a) and throughput variation (b) under Transpose 2 traffic
Usually, the average delay and the average throughput are reported by means of a
diagram for different packet injection rates. An example of such a diagrams is shown
in Fig. 1.13. The figure shows the delay variation and throughput variation under
Transpose 2 [9] traffic scenarios for different routing functions (XY, Odd-Even [6],
and APSRA [30]) and different selection policies (random, buffer-level [30]).
1.9 Summary
In this chapter, several important concepts in the domain of NoC design were
presented. We have discussed various topologies for direct and indirect networks.
Different switching, flow control mechanisms along with using virtual channels,
routing schemes, output selection technique, and a general network-on-chip archi-
tecture were also described. These concepts presented here are further mentioned in
various places in the rest of this book.
References
6. G.-M. Chiu, The odd-even turn model for adaptive routing. IEEE Trans. Parallel Distrib. Syst.
11(7), 729–738 (2000)
7. W.J. Dally, C. Seitz, Deadlock-free message routing in multiprocessor interconnection net-
works. IEEE Trans. Comput. C(36), 547–553 (1987)
8. W.J. Dally, B. Towles, Route packets, not wires: on-chip interconnection networks, in
ACM/IEEE Design Automation Conference, Las Vegas, 2001, pp. 684–689
9. W.J. Dally, B. Towles, Principles and Practices of Interconnection Networks (Morgan Kauf-
mann, San Francisco, 2004)
10. M. Daneshtalab, M. Ebrahimi, T.C. Xu, P. Liljeberg, H. Tenhunen, A generic adaptive path-
based routing method for MPSoCs. Elsevier J. Syst. Archit. 57(1), 109–120 (2011)
11. M.M. de Azevedo, D. Blough, Fault-tolerant clock synchronization of large multicomputers
via multistep interactive convergence, in International Conference on Distributed Computing
Systems, Hong Kong, 1996, pp. 249–257
12. J. Duato, A new theory of deadlock-free adaptive routing in wormhole networks. IEEE Trans.
Parallel Distrib. Syst. 4(12), 1320–1331 (1993)
13. J. Duato, A necessary and sufficient condition for deadlock-free routing in wormhole networks.
IEEE Trans. Parallel Distrib. Syst. 6(10), 1055–1067 (1995)
14. J. Duato, S. Yalamanchili, L. Ni, Interconnection Networks: An Engineering Approach
(Morgan Kaufmann, San Francisco, 2002)
15. M. Ebrahimi, M. Daneshtalab, F. Fahimeh, P. Liljeberg, J. Plosila, M. Palesi, H. Tenhunen,
HARAQ: congestion-aware learning model for highly adaptive routing algorithm in on-chip
networks, in ACM/IEEE International Symposium on Networks-on-Chip, Copenhagen, May
2012, pp. 19–26
16. M. Ebrahimi, M. Daneshtalab, P. Liljeberg, J. Plosila, J. Flich, H. Tenhunen, Path-based
partitioning methods for 3d networks-on-chip with minimal adaptive routing. IEEE Trans.
Comput. 99, pp. 1, doi: 10.1109/TC.2012.255
17. M. Ebrahimi, M. Daneshtalab, P. Liljeberg, J. Plosila, H. Tenhunen, Cluster-based topologies
for 3D networks-on-chip using advanced inter-layer bus architecture. Elsevier J. Comput. Syst.
Sci. 79(4), 475–491 (2013)
18. C.J. Glass, L.M. Ni, The turn model for adaptive routing. J. Assoc. Comput. Mach. 41(5),
874–902 (1994)
19. J. Hu, R. Marculescu, DyAD – smart routing for networks-on-chip, in ACM/IEEE Design
Automation Conference, San Diego, 7–11 June 2004, pp. 260–263
20. ITRS 2011 edition, International Technology Roadmap for Semiconductors (2011). http://
www.itrs.net/
21. A. Jantsch, H. Tenhunen (eds.), Networks on Chip, chapter 1 (Kluwer Academic, Boston, 2003)
22. S. Kumar, A. Jantsch, J.-P. Soininen, M. Forsell, M. Millberg, J. Oberg, K. Tiensyrja,
A. Hemani, A network on chip architecture and design methodology, in IEEE Computer
Society Annual Symposium on VLSI, Pittsburg, p. 117, 2002
23. K. Li, R. Schaefer, A hypercube shared virtual memory, in International Conference on
Parallel Processing, University Park, 1989, pp. 125–132
24. X. Lin, L.M. Ni, Multicast communication in multicomputer networks. IEEE Trans. Parallel
Distrib. Syst. 4, 1105–1117 (1993)
25. P.K. McKinley, H. Xu, E.T. Kalns, L.M. Ni, CompaSS: efficient communication services for
scalable architectures, in International Conference on Supercomputing, Washington, D.C.,
1992, pp. 478–487
26. G.D. Micheli, L. Benini, Powering networks on chips: energy-efficient and reliable intercon-
nect design for SoCs, in International IEEE Symposium on Systems Synthesis, Montréal, 2001,
pp. 33–38
27. P. Mohapatra, Wormhole routing techniques for directly connected multicomputer systems.
ACM Comput. Surv. 30(8), 374–410 (1998)
28. L.M. Ni, P.K. McKinley, A survey of wormhole routing techniques in direct networks. IEEE
Comput. 26, 62–76 (1993)
18 M. Danashtalab and M. Palesi
29. E. Nilsson, M. Millberg, J. Oberg, A. Jantsch, Load distribution with the proximity congestion
awareness in a network on chip, in Design, Automation and Test in Europe, Washington, D.C.,
2003, pp. 1126–1127
30. M. Palesi, R. Holsmark, S. Kumar, V. Catania, Application specific routing algorithms for
networks on chip. IEEE Trans. Parallel Distrib. Syst. 20(3), 316–330 (2009)
31. D. Park, S. Eachempati, R. Das, A. Mishra, Y. Xie, N. Vijaykrishnan, C.R. Das, MIRA: a multi-
layered on-chip interconnect router architecture, in International Symposium on Computer
Architecture, Beijing, 2008, pp. 251–261
32. J. Upadhyay, V. Varavithya, P. Mohapatra, A traffic-balanced adaptive wormhole routing
scheme for two-dimensional meshes. IEEE Trans. Comput. 46(2), 190–197 (1997)
33. H. Xu, P.K. McKinley, E.T. Kalns, L.M. Ni, Efficient implementation of barrier synchroniza-
tion in wormhole-routed hypercube multicomputers. J. Parallel Distrib. Comput. 16, 172–184
(1992)
34. T.T. Ye, L. Benini, G.D. Micheli, Packetization and routing analysis of on-chip multiprocessor
networks. J. Syst. Archit. 50(2–3), 81–104 (2004)
Part I
Performance Improvement
Chapter 2
A Heuristic Framework for Designing
and Exploring Deterministic Routing
Algorithm for NoCs
2.1 Introduction
Thanks to high performance and low power budget of ASICs (application specific
integrated circuits), they have been common components in the design of embedded
systems-on-chip. Advances of semiconductor technology facilitate the integration
of reconfigurable logic with ASIC modules in embedded systems-on-chip. Recon-
figurable architectures are used as new alternatives for implementing a wide range
of computationally intensive applications, such as DSP, multimedia and computer
vision applications [1]. In the beginning of the current millennium, network-on-
chip (NoC) emerged as a standard solution in the on-chip architectures [10, 11].
Turn model for designing partially adaptive routing algorithms for mesh and
hypercube networks was proposed in [9]. Prohibiting minimum number of turns
breaks all of the cycles and produces a deadlock-free routing algorithm. Turn model
was used to develop the Odd-Even adaptive routing algorithm for meshes [4]. This
model restricts the locations where some turns can be taken so that deadlock is
avoided. In comparison with turn model, the degree of routing adaptivity provided
by the Odd-Even routing is more even for different source-destination pairs.
DyAD routing scheme, which combines deterministic and adaptive routing, is
proposed in [12] for NoCs, where the router works in deterministic mode when
the network is not congested, and switches to adaptive mode when the network
becomes congested. In [23] the authors extend routers of a network to measure their
load and to send appropriate load information to their direct neighbors. The load
information is used to decide in which direction a packet should be routed to avoid
hot-spots. Recently, the authors in [19] present APSRA, a methodology to develop
adaptive routing algorithms for NoCs that are specialized for an application or a
set of concurrent applications. APSRA exploits the application-specific information
regarding pairs of cores that communicate and other pairs that never communicate
in the NoC platform to maximize communication adaptivity and performance.
Since all of these approaches are based on adaptive routing, they suffer from out-
of-order packet delivery. Our proposed routing framework overcomes this problem
while it minimizes the average packet latency across the network.
An application-aware oblivious routing is proposed in [14] that statically de-
termines deadlock-free routes. The authors presented a mixed integer-linear pro-
gramming approach and a heuristic approach for producing routes that minimize
maximum channel load. However, in case of realistic workload, they did not
study the effect of task mapping on their approach. Also, we have addressed the
congestion-aware routing problem in [15]. With the analysis technique, we first
estimated the congestion level in the network, and then embedded this analysis
technique into the loop of optimizing routing paths so as to find deterministic routing
paths for all traffic flows while minimizing the congestion level in the network.
Since this framework cannot capture the traffic burstiness, in this work we utilize an
analytical model [14] to deal with traffic burstiness.
The LAR framework consists of five steps as its flowchart is shown in Fig. 2.1.
At first, we represent the architecture and application using topology graph (TG)
and communication graph (CG), respectively. Then we construct the channel
dependency graph (CDG) based on TG and CG. In the third step, an acyclic CDG
is extracted by deleting some edges from CDG to guarantee the deadlock freedom.
24 A.E. Kiasari et al.
4 5 6 7
8 9 A B
C D E F
After that, we find all possible shortest paths for each flow to create the routing
space. Finally, we formulate an optimization problem over the routing space and
solve it. In the following subsections, each step is described in detail.
directed arc represents the communication volume from source task to destination
task. As an example, the CG of a video object plane decoder (VOPD) is shown in
Fig. 2.3 [24]. Each node in the CG corresponds to a task and the numbers near the
edges represent the bandwidth (in MBytes/s) of the data transfer, for a 30 frames/s
MPEG-4 movie with 1,920 × 1,088 resolution [24].
Dally and Seitz simplified designing deadlock-free routing algorithms with a proof
that an acyclic channel dependency graph (CDG) guarantees deadlock freedom [6].
Each vertex of the CDG is a channel in TG. For instance, vertex 01 in Fig. 2.4
corresponds to the channel from node 0 to node 1 in Fig. 2.2. There is a directed
edge from one vertex in CDG to another if a packet is permitted to use the second
channel in TG immediately after the first one. To find the edges of a CDG, we use
the Dijkstra’s algorithm to find all shortest paths between source and destination
of any flows in corresponding TG. CDG of a 4 × 4 mesh network (Fig. 2.2) under
minimal fully adaptive routing is shown in Fig. 2.4a, when any two nodes have the
need to communicate such as in the uniform traffic pattern.
a b
01 12 23 01 12 23
10 21 32 10 21 32
40 04 51 15 62 26 73 37 40 04 51 15 62 26 73 37
45 56 67 45 56 67
54 65 76 54 65 76
84 48 95 59 A6 6A B7 7B 84 48 95 59 A6 6A B7 7B
89 9A AB 89 9A AB
98 A9 BA 98 A9 BA
C8 8C D9 9D EA AE FB BF C8 8C D9 9D EA AE FB BF
CD DE EF CD DE EF
DC ED FE DC ED FE
Fig. 2.4 The CDG of 4 × 4 mesh network for minimal fully adaptive routing under (a) uniform
and (b) transpose traffic patterns
In this step, we apply Dijkstra’s algorithm to the acyclic CDG to find all shortest
paths between source and destination of flows in corresponding TG and create a set
of f flows RS = {F1 ,F2 , . . . ,Ff } where f is the number of all flows in the system.
Fi = (λi ,CAi , ni , Pi ), where λ i is the average packet generation rate and CAi is the
coefficient of variation (CV) of packet interarrival time for flow i. We remind that
the relationship between CV of random variable X and its moments is represented
by CX2 = x2 /x̄2 − 1. In [14], we show that CV of a random variable reflects the
burstiness intensity very well. ni is the number of available shortest paths for flow i
and Pi is itself a set and includes all ni routes for flow i.
Usually more than one shortest path is available between two nodes (ni > 1) in
the routing space RS, so it is reasonable to choose a path such that the average packet
latency is minimized. In the next subsection, we formulate an optimization problem
over RS to find a suitable route for each flow and then use the simulated annealing
heuristic to solve this problem.
j =
N
Wi→ (2.1)
⎪
⎪
⎪
⎪ λ jN CA2 +C2N
⎪
⎪ Sj
⎪
⎪
2 , 2≤i≤ p
⎪
⎩ 2 μN −
∑k=1
i−1
j λN k→ j
where the variables are listed in Table 2.2 along with other parameters used in
N
this chapter. Therefore, to compute the Wi→ j we have to calculate the arrival rate
from ICi to OCj (λi→ j ), and also first and second moments of the service time
N N N
2
of OCNj S̄Nj , sNj . In the following two subsections, packet arrival rate and
channel service time are computed.
Assuming the network is not overloaded, the arrival rate from ICi N to OCj N can
be calculated using the following general equation
λi→ j = ∑S ∑D λ × P × R S → D, ICiN → OCNj
N S S→D
(2.2)
In Eq. 2.2, the routing function R(S → D, ICi N → OCj N ) equals 1 if a packet from
IPS to IPD passes from ICi N to OCj N ; it equals 0 otherwise. Note that we assume a
deterministic routing algorithm, thus the function of R(S → D, ICi N → OCj N ) can be
predetermined, regardless of topology and routing algorithm. After that, the average
packet rate to OCj N can be easily determined as
λ jN = ∑i λi→
N
j (2.3)
After estimating the packet arrival rates, now we focus on the estimation of the
moments of channel service times. At first, we assign a positive integer index to each
output channel. Let Dj N be the set of all possible destinations for a packet which
passes through OCj N . The index of OCj N is equal to the maximum of distances
among N and each M where M ∈ Dj N . Obviously, the index of a channel is between 1
2 A Heuristic Framework for Designing and Exploring Deterministic Routing. . . 29
and diameter of the network. In addition, the index of all ejection channels is
supposed to be 0. After that, all output channels are divided into some groups based
on their index numbers, so that group k contains all channels with index k.
Determination of the channel service time moments starts at group 0 (ejection
channels) and works in ascending order of group numbers. Therefore, the waiting
time from lower numbered groups can then be thought of as adding to the service
time of packets on higher numbered groups. In other words, to determine the waiting
time of channels in group k, we have to calculate the waiting time of all channels in
group k − 1. This approach is independent of the network topology and works for
all kinds of deterministic routing algorithm, whether minimal or non-minimal.
In the ejection channel of RN , the head flit and body flits are accepted in ts + tw
and Lb cycles, respectively. Therefore, we can write S̄1N = ts + tw + Lb and since the
standard deviation of packet size is known, we can easily compute CSN . Now, by
1
N , can be
using Eq. 2.1, the waiting time of input channels for ejection channel, Wi→ j
determined for all nodes in the network, where 2 ≤ i ≤ p.
Although the moments of service time can be computed simply for all ejection
channels, service time moments of the other output channels cannot be computed
30 A.E. Kiasari et al.
M
2
2
= ∑k=1 Pj→k
q
si N
ts + tw + tr + W j→k
N
+ S̄kN − IBNj + OBNk × max (ts ,tw )
(2.5)
N N N
where Pi→ j is the probability of a packet entered form ICj to be exited from OCk
and equals
N
Pj→k = λ j→k
N
/λiM (2.6)
2
Here, we should remind that to calculate S̄iM and sM
i all values of S̄kN (1 ≤ k ≤ q)
must be computed before. Finally, the CV of channel service time for OCi M can be
given by
2 M
2
CS2M = sM
i / S̄i −1 (2.7)
i
Now, we are able to compute the average waiting time of all output channels
N
using Eq. 2.1. After computing Wi→ j for all nodes and channels, the average packet
latency between any two nodes in the network, LS → D , can be calculated. The
average packet latency is the weighted mean of these latencies.
aims to increase the size of crystals and reduce their defects by heating a material
and then slowly lowering the temperature to give atoms the time to attain the lowest
energy state.
To simulate the physical annealing process, the simulated annealing algorithm
starts with an initial solution and then at each iteration, a trial solution is randomly
generated. The algorithm accepts the trial solution if it lowers the objective function
(better solution), but also, with a certain probability, a trial solution that raises the
objective function (worse solution). Usually the Metropolis algorithm [2] is used as
the acceptance criterion in which worse solution are allowed using the criterion that
where ΔE is the difference of objective function with current and trial solutions
(negative for a better solution; positive for a worse solution), T is a synthetic
temperature, and R(0,1) is a random number in the interval [0,1]. Typically this step
is repeated until the system reaches a state that is good enough for the application, or
until a given computation budget has been exhausted. By accepting worse solutions,
the algorithm avoids being stuck at a local minimum in early iterations and is
able to explore globally for better solutions. Detailed information about simulated
annealing approach can be found in [17].
As mentioned in Sect. 2.3.5.1, objective function is the average packet latency
and decision variables are represented by the routing set X = {x1 ,x2 , . . . ,xf } where
xi is the path number for flow i (1 ≤ xi ≤ ni ). Let X = {x1 ,x2 , . . . ,xr , . . . ,xf } be the
initial routing set. To choose a trial routing set Xnew = {x1 ,x2 , . . . ,xr new , . . . ,xf }, we
generate a random number r where 1 ≤ r ≤ f to choose a flow, and then generate
another random number xr new where 1 ≤ xr new ≤ nr and xr new = xr to choose another
path for flow r. Using analytical model describe in Sect. 2.3.5.2, we estimate the
average packet latency for current and trial routing set.
to avoid distortions due to the startup transient. The standard deviation of latency
measurements is less than 1.8% of the mean value. As a result, the confidence level
and confidence interval of simulation results are 0.99 and 0.02, respectively.
For the sake of comprehensive study, numerous validation experiments have been
performed for several combinations of workload types and network size. In what
follows, the capability of LAR will be assessed for both synthetic and realistic traffic
patterns. Since their applications differ starkly in purpose, these classes of NoC have
substantially different traffic patterns.
Synthetic traffic patterns used in this research include uniform, transpose, shuffle,
bit-complement, and bit-reversal [5]. After developing models describing spatial
traffic distributions, we should use an appropriate model to model the temporal
traffic distribution. In the case of synthetic traffics, we use the Poisson process
for modeling the temporal variation of traffic. It means that the time between two
successive packet generations in a core is distributed exponentially. The Poisson
model widely used in many performance analysis studies, and there are a large
number of papers in many application domains that are based on this stochastic
assumption.
The average packet latencies in the 4 × 4 and 8 × 8 mesh networks are plotted
against offered load in the network in Figs. 2.5 and 2.6, respectively. We observe
that under uniform and bit-complement traffic patterns LAR converges to DOR,
because in such traffic patterns the average packet latency is minimum for DOR. It
means that the simulated annealing algorithm is not able to find better routes and
the final solution is the same as initial solution. This result is consistent with other
results reported in [4, 9, 12, 19]. The main reason is that the DOR distributes packets
evenly in the long term [9]. Previous works, Odd-Even [4], turn model [9], DyAD
[12], and APSRA [19] indicate that in the case of uniform traffic, their proposed
approaches underperform DOR. However, as can be seen in Figs. 2.5a and 2.6a, our
proposed framework has the same performance as DOR for different traffic loads.
Figure 2.5b, c compare the latency of DOR and LAR in 4 × 4 mesh network
under transpose and bit-reversal workloads, respectively. It can be vividly seen that
LAR considerably outperforms DOR. Also, in the case of 8 × 8 mesh network, LAR
has better performance than DOR as shown in Fig. 2.6b, c.
Figures 2.5d and 2.6d reveal that under shuffle traffic pattern LAR slightly
outperforms DOR. Table 2.3 shows the maximum sustainable throughput of the
network for each workload and for each routing algorithm in 4 × 4 and 8 × 8 mesh
networks. It also shows the percentage improvement of LAR over DOR and reveals
that on average LAR outperforms DOR. The maximum load that the network is
capable of handling using LAR is improved by up to 205%.
Also, the performance of LAR framework is compared against DyAD routing
scheme [12] which combines deterministic and adaptive routing algorithms. We
2 A Heuristic Framework for Designing and Exploring Deterministic Routing. . . 33
Latency (cycles)
Uniform - DOR
Latency (cycles)
Uniform - LAR
150 150
90 90
30 30
0 1 2 3 4 5 6 7 0 3 6 9 12
Offered traffic (flits/cycle) Offered traffic (flits/cycle)
Latency (cycles)
Latency (cycles)
150 150
90 90
30 30
0 3 6 9 12 0 1 2 3 4 5 6 7
Offered traffic (flits/cycle) Offered traffic (flits/cycle)
Fig. 2.5 Average packet latency under (a) uniform and bit-complement, (b) transpose, (c) bit-
reversal, and (d) shuffle traffic patterns in 4 × 4 mesh network
Latency (cycles)
Uniform - LAR
200 200
150 150
100 100
50 50
0 5 10 15 0 2 4 6 8 10
Offered traffic (flits/cycle) Offered traffic (flits/cycle)
Latency (cycles)
210
200
150
120
100
50 30
0 3 6 9 0 3 6 9 12
Offered traffic (flits/cycle) Offered traffic (flits/cycle)
Fig. 2.6 Average packet latency under (a) uniform and bit-complement, (b) transpose, (c) bit-
reversal, and (d) shuffle traffic patterns in 8 × 8 mesh network
In case of realistic traffic, we consider two virtual channels for links to show
the consistency of proposed framework with multiple virtual channel routing.
As realistic communication scenarios, we consider a generic multimedia system
(MMS) and the video object plane decoder (VOPD) application. MMS includes an
H.263 video encoder, an H.263 video decoder, an mp3 audio encoder, and an mp3
audio decoder [13]. The communication volume requirements of this application
are summarized in Table 2.5. VOPD is an application used for MPEG-4 video
decoding and its communication graph is shown in Fig. 2.3. Several studies reported
the existence of bursty packet injection in the on-chip interconnection networks for
multimedia traffic [22, 25].
2 A Heuristic Framework for Designing and Exploring Deterministic Routing. . . 35
r1
Poisson process is not the appropriate model in case of bursty traffic; conse-
quently, we used Markov-modulated Poisson process (MMPP) model as stochastic
traffic generators to model the bursty nature of the application traffic [5, 8]. MMPP
has been widely employed to model the traffic burstiness in the temporal domain [8].
Figure 2.7 shows a two-state MMPP in which the arrival traffic follows a Poisson
process with rate λ 0 and λ 1 . The transition rate from state 0 to 1 is r0 , while the rate
from state 1 to state 0 is r1 .
Since in such systems, there are various types of cores with different bandwidth
requirements, placement of tasks on a chip has strong effect on the system
performance. To find a suitable mapping of these applications, we formulate another
optimization problem to prune the large design space in a short time and then again
use the simulated annealing heuristic to find a suitable mapping vector. Initially, we
map task i to node i and then try to minimize the average packet latency through
the simulated annealing approach. Figure 2.8a shows that in the case of MMS
application and DOR, for the initial mapping M1, average packet latency equals 87
and after a certain number of tries, the mapping vector converges to the mapping M4
with average packet latency = 41. Furthermore, average packet latency values for
mappings M2 and M3, which are two local minimum points in simulated annealing
process, are shown in the figure.
36 A.E. Kiasari et al.
a Mapping & Routing Effect on Performance b Mapping & Routing Effect on Performance
87.00
90 71.00 80
64.00
0 0
M1
M1 M2
M2 DOR M3
DOR M3 LAR M4
LAR M4
Fig. 2.8 The effect of mapping and routing on the performance of (a) MMS application and
(b) VOPD application
After the mapping phase, we apply the LAR framework to these four mapping
vectors. Figure 2.8a reveals that in case of mapping M1, LAR can significantly
reduce the average packet latency from 87 to 67. However, for more efficient
mapping vectors (M2, M3, and M4), we achieve less improvement. Specially, in
the case of best mapping (M4), average packet latency is reduced insignificantly
from 41 to 40. It is reasonable that DOR is latency-aware for the best mapping,
because during the mapping problem solving process, we fix the routing policy to
DOR and strive to minimize average packet latency for this routing policy. Likewise,
as shown in Fig. 2.8b, for the VOPD application, the analysis result is the same as
MMS application.
Figure 2.8 reveals that in case of application-specific traffic patterns, the im-
provement in the performance of the routing schemes highly depends on how the
application tasks are mapped to the topology. This fact was not considered in the
related works such as [16]. Nowadays, in embedded systems-on-chip there are
several different types of cores including DSPs, embedded DRAMs, ASICs, and
generic processors which their places are fixed on the chip. On the other hand, such a
system hosts several applications with completely different workload. Furthermore,
modern embedded devices allow users to install applications at run-time, so a
complete analysis of such systems is not feasible during design phase. As a result, it
is not feasible to map all applications such that the load is balanced for all of them
with specific routing algorithm and we should balance the load in routing phase.
In this section we used the LAR framework to find low latency routes in the mesh
network. Due to simplicity, regularity, and low cost merits of 2D mesh topology, it
is the most popular one in the field of NoC. However, for large and 3D NoCs, which
will be popular in the future, the communication in mesh architecture takes a long
time. In the next subsection we use LAR to find deadlock-free paths in an arbitrary
topology.
2 A Heuristic Framework for Designing and Exploring Deterministic Routing. . . 37
a 0 b 0
1 2 1 2
3 4 5 3 4 5
6 7 8 9 6 7 8 9
2.5 Conclusion
packet latency in the network, and then embed this analysis technique into the loop
of optimizing routing paths so as to quickly find deterministic routing paths for all
traffic flows while minimizing the latency.
The proposed framework is appropriate for reconfigurable embedded systems-
on-chip which run several applications with regular and repetitive computations on
large set of data, e.g., multimedia and computer vision applications. LAR can not
only design minimal and deterministic routing, but also can implement non-minimal
routing without virtual channels in arbitrary topology.
References
19. M. Palesi et al., Application specific routing algorithms for networks on chip. IEEE Trans.
Parall. Distr. Syst. 20(3), 316–330 (2009)
20. K. Pawlikowski, Steady-state simulation of queueing processes: A survey of problems and
solutions. ACM Comput. Surv. 22(2), 123–170 (1990)
21. C. Sechen, A. Sangiovanni-Vincentelli, The TimberWolf placement and routing package.
J. Sol. State Circuit. SC-20, 510–522 (1985)
22. V. Soteriou, H. Wang, L.-S. Peh, A statistical traffic model for on-chip interconnection
networks, in Proceedings of the MASCOTS (Monterey, 2006), pp. 104–116
23. W. Trumler et al., Self-optimized routing in a network-on-a-chip. IFIP World Comp. Cong.
268, 199–212 (2008)
24. E.B. van der Tol, E.G. Jaspers, Mapping of MPEG-4 decoding on a flexible architecture
platform. SPIE 4674, 1–13 (2002)
25. G. Varatkar, R. Marculescu, Traffic analysis for on-chip networks design of multimedia
applications, in Proceedings of the Design Automation Conference (New Orleans, 2002),
pp. 795–800
Chapter 3
Run-Time Deadlock Detection
R. Al-Dujaily ()
University of Southampton, Southampton SO17 1BJ, UK
e-mail: r.al-dujaily@ecs.soton.ac.uk
T. Mak
The Chinese University of Hong Kong, Ho Sin-Hang Engineering Building, Shatin, Hong Kong
e-mail: stmak@cse.cuhk.edu.hk
F. Xia • A. Yakovlev
Newcastle University, Newcastle Upon Tyne, NE1 7RU, UK
e-mail: fei.xia@newcastle.ac.uk; alex.yakovlev@newcastle.ac.uk
M. Palesi
Kore University, Enna, Italy
e-mail: maurizio.palesi@unikore.it
3.1 Introduction
the strict ordering of virtual channels (VCs) [15]. In general, avoidance techniques
require restricted routing functions or additional resources, e.g., VCs. Due to its
simplicity the turn model technique is popular in NoCs, even though it limits the
routing alternatives and diminishes fault tolerance capabilities [21]. Moreover, it is
not applicable to arbitrary network topologies.
Alternatively, deadlock recovery implies that channels are granted to packets
without any routing restrictions, potentially outperforming deadlock avoidance
[3, 25]. Deadlocks may occur and efficient detection and recovery mechanisms
are required to intervene. However, detecting deadlock in a network is challenging
because of the distributed nature of deadlocks. Heuristic approaches, such as timeout
mechanisms, are often employed to monitor the activities at each channel for
deadlock speculations. These techniques may produce a substantial number of false
detections, especially with the network close to saturation where blocked packets
could be flagged as deadlock. Several techniques have been proposed for reducing
the number of false detections in general computer networks [22,24,30], nonetheless
they are all based on the timeout idea and finding the best threshold values for
different network settings is not an easy matter.
Unlike general computer networks, where internode information can only be
exchanged through packets, on-chip networks can take advantage of additional
dedicated wires to transmit control data between routers. This chapter exploits this
NoCs-specific capability and proposes a new deadlock detection method which
guarantees true deadlock detection. A run-time transitive-closure (TC) computation
scheme is employed to discover the existence of deadlock-equivalence sets which
imply loops of requests. Also, the proposed detection scheme can be realized using
a distributed architecture, which closely couples with the NoC infrastructure, to
speed up the necessary computation and to avoid introducing traffic overhead in the
communication network. Initial results of this study and a sketch of the proposed
architecture were presented in [1]. In [2], a complete theoretical framework of the
TC computational approach is presented. Hardware architecture for the framework
realization is detailed and experimental results on real-life applications are included.
44 R. Al-Dujaily et al.
The study of deadlock recovery in the context of NoCs is rare and, as a consequence,
the following information has been mainly acquired from the field of general
computer networks. In [27, 33], the authors conclude that deadlocks can be highly
improbable in interconnection networks when sufficient routing freedom is provided
and fully exploited by the routing algorithm. Thus it is not favorable to limit the
adaptivity of the routing algorithm to avoid a rare event, e.g., using the turn model
[9,10], nor to complicate the routers’ design by devoting VCs specifically to prevent
deadlocks [14, 15]. Since then, deadlock recovery has gained recognition for its
potential for outperforming deadlock avoidance provided efficient detection and
recovery mechanisms exist [22, 24, 25, 30]. Deadlock detection and recovery are
the two important stages of any deadlock recovery scheme [15].
In the detection stage, the network must discover at run-time any deadlock
dependency cycle. However, detecting deadlocks at run-time is challenging because
of their highly distributed characteristics. Thus deadlock detection is usually imple-
mented in a distributed way using a timeout mechanism [3,4,18,21,22,24,25,30]. In
its simple form, a packet occupying a channel is suggested to be in a deadlock if the
channel has been inactive for a given threshold time value [3] and thus the recovery
stage is started. The minimum hardware components for each physical channel
in the router to implement such a scheme are: a counter, comparator, latch and a
register to store the threshold value, in case a programmable threshold is required.
Accurate detections of deadlock in this mechanism are very sensitive to network
load and message length transmitted over network channels. A long threshold time
leads to more accurate deadlock detections, but takes a longer time to discover
deadlock and thus increases packets blocking and dramatically degrades network
performance [25]. Conversely, a short threshold value detects deadlock faster, but
with a higher probability of false detections (false positives) and thus could saturate
deadlock recovery channels, in the case of a dedicated central channel used for
recovery in each router [3], which again can reduce the network performance [25].
Several techniques have been proposed for interconnection networks to reduce
the number of false positive deadlocks. In [24] a packet is suggested as being
deadlocked if all requested channels by a blocked packet are inactive for a given
timeout. To further reduce the number of false deadlock detections, the mechanism
presented in [25] is intended to identify only one packet in a sequence of blocked
packets as being deadlocked. The work of [25] is less vulnerable to false detections
at the expense of extra hardware (two comparators, two latches and two threshold
values). In [30] the author proposed a technique that employs special control packets
(called probes) to cross along inactive channels for more accurate detection. At
large, all these works are based on the timeout mechanism and pose a difficulty in
tuning the threshold value, i.e., to select a unique threshold value for different traffic
and network loads. Therefore, a method which accurately detects deadlock without
false alarms is required. A method which quickly detects deadlock independent of
the network’s traffic and load is desirable.
3 Run-Time Deadlock Detection 45
In the recovery stage, there are two kinds of deadlock recovery schemes:
regressive and progressive. A regressive recovery is based on the abort-and-retry
mechanism [18] which eliminates the suspected packets from the network. While a
progressive recovery resolves deadlocked configuration without removing suspected
packets from the network. For example, the DISHA progressive recovery scheme
utilizes additional hardware in each router (central buffer) to bypass the suspected
packets to their destination sequentially [3] or concurrently [4]. The bandwidth of
these central channels is a fraction of the original network bandwidth. Therefore if
the detection technique is detecting a lot of false deadlocks this will saturate the
recovery bandwidth and as a result will degrade the network’s performance [25].
This chapter presents a new deadlock detection method which guarantees true
deadlock detection for NoCs. The results of this study, based on a cycle-accurate
simulator, demonstrate the effectiveness of the method. It drastically outperforms
timing-based deadlock detection mechanisms by eliminating false detections in
various traffic scenarios. In this chapter the emphasis is not on comparing with
deadlock avoidance techniques. Hence, a simple abort-and-retry approach to recover
from detected deadlock is employed. Thus the primarily target is to analyze the
deadlock detection rate and trueness of each studied detection techniques.
The following sections present the proposed deadlock-equivalence set principle and
the general methodology for the proposed deadlock detection method.
This work assumes fully adaptive routing algorithm with minimal paths. ‘Minimal’
in this context means that the routing algorithm always chooses the shortest path
between sender and receiver. A wormhole flow control technique [12] is employed,
a method which has been widely used in NoCs [29]. The packets in wormhole flow
control are divided into smaller flits. A header flit is routed first and the rest will
follow it in a pipeline manner and this will allow a packet to occupy several channels
simultaneously. Thus, wormhole reduces the number of buffers required in each
router which is a desirable feature for on-chip networks. However, wormhole makes
networks more prone to blocking and deadlock [26, 33].
In this chapter, NoCs are studied without using any deadlock avoidance tech-
niques, but adopting deadlock recovery with an accurate deadlock detection method
(proposed in the next section). In line with existing work on deadlock detec-
tion/recovery [3, 22, 24, 25, 30], this work assumes that a channel buffer cannot
contain flits belonging to different packets. Moreover, a packet arriving at its
destination is eventually consumed. In other words, the network is deadlock free
46 R. Al-Dujaily et al.
at the protocol interaction between the NoC and IP cores. It is important to notice
that this work focuses on studding deadlocks that caused by packets at the network
routing level. Extending this work to detect deadlocks that may arise by message
dependencies which caused by protocol interaction between the IP cores through
their network interfaces (NIs) and the network (routers and channels) is one of the
future work. Moreover, the used terminology for an agent that own and request
resources (channels) in this work is a flit. However, it is equally true that an agent is
a message or a packet.
This section introduces the proposed method for deadlock detection in NoCs. It first
introduces some important definitions and then defines a deadlock-equivalence set
(DES) criterion for detecting loops of packet requests. It uses the TC computation
to determine whether there is a set of channels in the NoCs forming a DES. The
following definitions lead to that of a deadlock in NoCs:
Definition 1. A SoC and/or CMP consists of communication infrastructure called
a NoC and computation/storage cores called IPs.
Definition 2. A NoC N (V , C ) is a strongly connected directed graph, where V is
a set of elements called Vertices that represent a set of router nodes and C = V xV is
a set of ordered pairs, (ci , c j ) = (c j , ci ), called edges that represent a set of Channels
that connect routers. A single channel is only allowed to connect a given pair of
routers in one direction.
Definition 3. An I is a set of processing/storage elements represent the IP
integrated on-chip. Each IP has one injection channel and one delivery channel
directly connected to a router (v ∈ V ).
Definition 4. The routing function R in N is a function that return the output
channel, cout ∈ C , for each current node, vc ∈ V , and destination node, vd ∈ V ,
so that cout = cin . In other words, deflection routing is not allowed and no channel
has the same network node as both its source and destination.
The information routed and propagated in the NoC are packets/flits. These
represent the agents that own and/or request the network resources (channels). The
resource ownership and request in a NoC at any particular time can be expressed as
a channel wait-for graph (CWG) [14].
Definition 5. A CWG is a directed graph G = (C , E ) where C is a set of vertices
in the network N and represents the set of channels. E is a set of ordered pairs
called edges which represents the set of channel occupation and requisition status.
At any particular state of the NoC, there exists an edge (u, v) ∈ E either if (1) there
is a head flit in channel u requesting channel v, or if (2) there is a flit in channel u
and another flit in channel v and both of them belong to the same packet. Case (1)
3 Run-Time Deadlock Detection 47
refers to the request status and is drawn as dashed arcs in the CWG, while case (2)
refers to the ownership status and is drawn as solid arcs.
For instance, case (1) can be seen in Fig. 3.1, where a head flit occupying ch1 and
requesting ch2 . Also case (2) is shown in the figure where a data flit occupying ch2
and its head flit in ch3 . A matrix representation of the CWG can be extracted based
on the Adjacency Boolean matrix.
Definition 6. The Adjacency Boolean matrix A of a directed graph G = (C , E ) is
an n × n matrix, where n is the cardinality of C and ai, j ∈ A, ∀i, j ∈ n. A is constructed
as follows: (1) aij = 1(True) iff (i, j) ∈ E (edge exists between vertex i and vertex
j), (2) aij = 0(False) otherwise (including when i = j, see Definition 4).
Given a directed graph G , it is possible to answer reachability questions using
the concept of transitive-closure. For example, can one get from node (vertex) u to
node v in one or more edges (hops)?
Definition 7. The Transitive Closure (TC) of G is a derived graph, G + = (V , E + ),
which contains an edge (u, v) if, and only if, there is a path from u to v in one or
more hops. It means that if G contains the edges (u, w) and (w, v), then v can be
reached from u (transitivity property).
The derived graph G + , the transitive-closure of G , is the result of adding to G
only the edges that cause G to satisfy the transitivity property and no other edges
(i.e., not adding edges that do not represent paths in the original graph). Thus, the
TC of a directed graph is obtained by adding the fewest possible edges to the graph
such that it is transitive. It can be computed from A using Floyd-Warshall algorithm
[11] (see Algorithm 1, lines 12–18) and will be denoted as T .
At this point, it is necessary to introduce the deadlock-equivalence set criterion
for detecting a loop of packet requests.
Definition 8. Given a set of channels C = {c1 , c2 , . . . , cn } and n =| C |, from the
network N with n channels and consider a subset M = {c1 , c2 , . . . , cm } of channels
for some m ≤ n. M is a DES iff in all its channels there are flits waiting for
one another in a cyclic manner to progress to their respective destinations, i.e., c1
occupied by a flit and it requests c2 , c2 occupied by a flit and it requests c3 , . . . , cm−1
occupied by a flit and it requests cm and cm occupied by a flit and it requests c1 .
In NoCs characteristics, the CWG, at a particular time, has for any node at most
one outgoing arc. Hence, a node can appear in a DES only once and cannot appear
in multiple DES’s at the same time. Thus, members of a set of simultaneous DES’s,
S = {Si }, are pair wise disjoint; that is, Si , S j ∈ S and i = j implying Si ∩ S j = φ .
The TC computation can be used to determine whether there is a set of channels in
the network forming a DES. To demonstrate the idea so far, consider the following
example:
Example 1. Given a channel i and a channel j, where i, j ∈ C . Assume there is
a flit occupies each of these channels. Also, each flit requests the other channel,
48 R. Al-Dujaily et al.
i.e., the flit occupies channel i requesting channel j and the one occupies channel j
requesting channel i. Then the corresponding adjacency Boolean matrix is:
01
A= .
10
i.e., T1,2 = T2,1 = 1 and T1,1 = T2,2 = 1. The last two imply a loop of channel requests
as channel i requesting itself and channel j requesting itself. These will be shown as
self-reflexive paths in the TC graph (see Fig. 3.2).
This can be extended to a subset M = {c1 , c2 , . . . , cm } of channels for some
m ≤ n, such that all pairs of elements in M meet the self-requesting condition:
1 if Ti, j = T j,i = Ti,i = T j, j = 1,
DES(i, j) = ∀i, j ∈ M (3.1)
0 otherwise.
To recover from a deadlock situation, the dependency cycle formed by the DES
should be resolved. A deadlock can be resolved if one or more of the packets
forming the deadlock is removed from the network [18, 22, 25]. The next section
presents the DES computational complexity for NoCs.
As shown in the previous section (more details can be found in [2]), the deadlock-
equivalence set (DES) provides a simple criterion for deadlock detection. The
technique is to generate the CWG from the network and then derive the TC of
the CWG and identify the channels that satisfy the criterion. Figure 3.3 illustrates
this idea. Given a network at any particular state (Fig. 3.3a) the CWG can readily
be drawn (Fig. 3.3b). The derived TC graph (Fig. 3.3c) clearly shows four vertices
(channels) with self-reflexive paths and all pairs of these satisfy the condition of the
DES (Eq. 3.1).
3 Run-Time Deadlock Detection 49
b c
the TC computation is presented in the next section. Moreover, the reflexive property
is exploited to further simplify DES detections in NoCs to realize the proposed
deadlock detection.
Dynamic programming (DP) can yield the solution for the TC [11] with an
opportunity for solving the computation using a parallel architecture. Mapping
TC computation to a parallel computational platform can be achieved with the
introduction of a TC-network. The network has a parallel architecture and can be
used to compute the TC solution through the simultaneous propagation of successive
inferences. Lam and Tong [20] introduced DP-networks to solve a set of graph
optimization problems with an asynchronous and continuous-time computational
context. This inference network is inherently stable in all cases and has been shown
3 Run-Time Deadlock Detection 51
where ∧ (AND function) is the inference for the site function and the conflict-
resolution operator for the unit function. The operator ∨ (OR function) denotes
the unit which resolves the binary relation (i, j). The computational units will be
interconnected in the same way as the NoC structure. Each unit represents a router
node and a link signifies a communication channel. A distributed network can
readily be implemented using such realization.
The delay of TC-network convergence to deadlock-detection depends on the size
of the DES and the network topology, which both determine the delay of information
propagation within the NoC and the delay of each computational unit. As can
be seen, each unit involves O(|A|) AND operations and one OR operation where
|A| is the number of adjacent edges. Hence, the solution time is O(k|A|) where k
is the number of iterations evaluated by each unit. In a software computation, k
equals to the number of nodes in the network which guarantees that all nodes have
been updated. Nonetheless, in the hardware implementation with parallel execution,
k will be determined by the network structure and |A| AND operations can be
52 R. Al-Dujaily et al.
executed in parallel. Each computational unit can simultaneously compute the new
expected output for all neighbor nodes. The delay of TC-network convergence to
deadlock detection is directly proportional to the size of the DES, which determines
the delay of information propagation within the TC-network and the delay of TC
computational units. The worst case delay for the TC-network to converge to the
deadlock detection is investigated for several popular network topologies in the
NoCs literature. Some topologies are excepted because they are inherently deadlock
free in conjunction with the routing function stated in Definition 4, e.g., tree, bus,
star, butterfly, etc. Table 3.1 shows the largest DES for several network topologies
as a function of the network size (k). The induction is used to find these equations.
Consider the k-ary 1-cube (ring) topology: the maximum number of channels in
such a network is 2k, while the largest and the only possible DES is k as this will
be the longest dependency cycle. Likewise for the k-ary 2-mesh network topology
(2D mesh) of k2 nodes with k rows and k columns, the smallest DES in this network
will span over four nodes while the largest DES will be 3k2 − 3k − 2 and thus,
the detection time will depend on the network topology and on how deadlocks are
distributed over the network nodes.
Network
R R R R
channels
R R R R
neighbor units if, and only if, there is a chain of channel dependencies between its
input and output. This makes the TC-network self-pruning and will keep switching
power to minimum. The function performed in the TC-unit that seizes the token is
described in Algorithm 2. The algorithm starts checking each channel in the router
node. In a case where a self-reflexive path is detected for a particular channel,
line 10, it means the channel is part of a DES and that channel corresponding
deadlock flag is set (line 11), which may be used by the router node to trigger a
recovery. Otherwise, the deadlock flag is reset and the next channel will be tested.
Once the TC-unit finishes checking all the channels of the corresponding router,
it passes the token to the next neighbor unit. The delay time of the TC-network
to converge and provide useful information will mainly depend on the DES size.
However, the TC-network can produce a valid output even if it does not converge in
a single clock cycle, since a deadlock is a steady and persistent event [14].
dd_flag 1
tc_rx2
ch2_occup ch_sel 1
Crossbar ch2_req_ch1
b b n nxm m tc_rx3
Input 1 Output 1 ch3_occup 0 tc_tx 1
1 1 ch3_req_ch1
1
b b 1 1 tc_rxn 1
Input 2 TC computation 1 chn_occup
1 chn_req_ch1 dd_flag 2
tc_rx1
ch_sel[] m ch1 _occup ch_sel 2
b b ch1 _req_ch2
Input n Output m Channel selection tc_rx3
circuit ch3_occup 0 tc_tx 2
ch3_req_ch2
start 1
1 tc_rxn
token_in 1 token_out chn_occup
1 1 chn_req_chm
token_ack_out token_ack_in
Fig. 3.6 Schematic of the router. (Left) Top-level view. (Middle) Block implementing the TC-unit.
(Right) Block implementing the TC-computation
wires in order to perform the proposed deadlock detection. These requirements are
evaluated and compared, in Sect. 3.5.2.3, with similar router architecture using the
state-of-the-art timeout detection mechanism.
The architecture introduced in Fig. 3.6 is only one of the possible router imple-
mentations. Different optimizations could be applied, depending on the particular
case under consideration. A point of interest could be the generalization of the
approach in order to check more than one router channel concurrently. This can be
accomplished by circulating more than one token in the token-ring protocol chain,
thus activating more than one router in the network to perform the checking of
its channels. It should be noted that a further TC-unit block and TC-interconnect
would be required, thus leading to additional area/power increase. However, for the
investigated traffic patterns and network topology the results are not substantially
different.
where E denotes energy, Edd _resolving is the energy consumed to resolve any detected
deadlock, h is the number of hops the flit passed before being aborted, and Flitage is
the number of clock cycles the flit existed in the NoC (either moving or waiting for
resources to be freed) and f is a Boolean flag that adds the last term to the equation
if the flit type is head (if the blocked flit is head it will continue consuming energy
by trying to reserve an output channel in each clock cycle).
3 Run-Time Deadlock Detection 57
The performance evaluation was carried out using a modified version of Noxim
[16]. In particular, Noxim is modified by introducing the TC-network and the
timing-based deadlock detection methods [3,24,25]. Without any loss of generality,
a NoC with the following specification is chosen for the evaluation: a mesh topology
with five ports router architecture, fully adaptive routing with random selection
function, no virtual channels, and a crossbar switch – these can be used in a wide
range of NoC configurations. Each input channel consists of four flit buffers and
one clock cycle is assumed for routing and transmission time across the crossbar
and a channel. The results are captured after a warm up period of 10,000 clock
cycles. The overall simulation time is set to 300,000 clock cycles. To ensure the
accuracy of results captured, with a higher degree of confidence, the simulation at
each injection rate is repeated five times with different seeds and their mean values
taken.
Figure 3.7 shows the performance results for a 4 × 4 2D mesh NoC and a uniform
distribution of packet destinations1 with different injection rate. The packet lengths
are randomly generated between 2 and 16 flits. Examining Fig. 3.7a, the majority of
detected deadlocks using the timeout method [3] are false alarms. For instance, 22%
of the packets injected in the network at higher injection rate are detected as part of
deadlocks with the threshold value set to 32. The TC-network instead detected that
less than 1% of packets are in true deadlocks, consistent with literature [33] which
stated that deadlock is an infrequent event.
Figure 3.7b shows the network average delay versus the throughput for full
load range. It shows that a smaller timeout threshold value used in the timeout
mechanism improves these two important network metrics because it detects more
false deadlocks and, by dropping them, will alleviate network contention that may
turn into congestion. Also, resolving the detected deadlocks by merely dropping the
detected packets without retransmission increases the NoC consumed energy but
adds no significance in terms of the network throughput and latency.
By examining Fig. 3.7a, b, the threshold value of 256 could be selected as the best
value for such a network setting as it produces a minimum detection percentage of
2.7% with good throughput and latency. The selection of the best threshold value
for different network settings (packets’ length, buffer size and traffic type) was the
goal of several studies [22, 24, 25, 30]. However, the TC-network method detects
1 Thiskind of traffic pattern is the most commonly used traffic in network evaluation [14], even
though it is very gentle because it naturally balances the load all over the network.
58 R. Al-Dujaily et al.
a b
25
Timeout−512 Timeout−512
20 Timeout−256 150 Timeout−256
Timeout−128 Timeout−128
Timeout−64 Timeout−64
15 Timeout−32 100 Timeout−32
TC−network TC−network
10
50
5
0 0
0.1 0.15 0.2 0.25 0.3 0.35 0.05 0.1 0.15 0.2 0.25
Injection rate [flits/cycle/node] Throughput [flits/cycle/IP]
c 25
Timeout−512
20 Timeout−256
Timeout−128
Energy [%]
Timeout−64
15 Timeout−32
TC−network
10
0
0.1 0.15 0.2 0.25 0.3 0.35
Injection rate [flits/cycle/node]
Fig. 3.7 Performance evaluation of the proposed deadlock detection method (TC-network) and
the timeout mechanism under uniform traffic scenario. (a) Percentage of detected deadlocked
packets to the total received packets. (b) NoC performance with different injection rates.
(c) Percentage of total energy consumed to resolve detected deadlocks to the total NoC energy
true deadlocks and produces similar performance figures to the timeout mechanism
with a 256 threshold value and without the need for any parameters tuning.
Figure 3.7c shows the percentage of energy consumed to resolve detected
deadlocks to the total network consumed energy. The figure is similar to, but
not linearly proportional to the detection percentage figure (Fig. 3.7a). This is
because significant energy consumption is caused by the routing function [16] which
repeatedly tries to route the head flit until the time out value has passed in the case
of the timeout method (see Eq. 3.4). For instance, the timeout-128 detects 5.3%
at saturation and it wastes 8.8% of the total consumed energy by aborting these
packets, while the Timeout-512 detects 2.6% at saturation and wastes 9.1% energy.
There are two underlying reasons for this: the first is that the NoC with a bigger
threshold value is delivering fewer flits at the given simulation time (Fig. 3.7b); the
second reason is that the Flitage in Eq. 3.4 is directly proportional to the threshold
value in the case when a packet is detected as deadlocked.
To investigate different traffic scenarios and NoC sizes, Fig. 3.8 shows the
performance results with the bit-reversed traffic.2 The network size is 8 × 8 mesh
2 Each node with binary address {bn−1 , bn−2 , . . ., b0 } sends a packet to the node with address {b0 ,
b1 , . . ., bn−1 }.
3 Run-Time Deadlock Detection 59
a b
Deadlocked packets [%]
c 50
Timeout−1024
Timeout−512
40 Timeout−256
Energy [%]
Timeout−128
30 Timeout−64
TC−network
20
10
0
0 0.1 0.2 0.3 0.4
Injection rate [flits/cycle/node]
Fig. 3.8 Performance evaluation of the proposed deadlock detection method (TC-network) and
the timeout mechanism under bit-reversed traffic scenario. (a) Percentage of detected deadlocked
packets to the total received packets. (b) NoC performance with different injection rates.
(c) Percentage of total energy consumed to resolve detected deadlocks to the total NoC energy
and the packet sizes are randomly chosen between 32 and 64 flits. The results, in
general, show a similar trend to the previous example. Here, the threshold value
of 1,024 could be selected as the best threshold. The TC-network method detects
around 0.07% of packets as deadlocked and dropping them consumed energy of
less than 0.2% compared to 4% detected using Timeout-1024 and a waste of energy
of around 10%.
Moreover, this study implemented the deadlock detection techniques proposed
in [24, 25], which were proposed to enhance the accuracy of deadlock detection
over the crude timeout method [3], i.e., to reduce the false-positive-deadlock
alarms. The threshold value for each simulated timeout technique is selected as
16, 32, 64, 128, 256, 512, and 1,024 clock cycles, but only the results for 64 and 256
cycles are presented, since the only major difference is the measures scaling up or
down. In the following figures, the deadlock detection methods proposed in [3, 24]
and [25] will be labeled as crude, sw and fc3d respectively, followed by the used
timeout threshold value.
For the rest of the synthetic traffic scenarios, the results are summarized in
Table 3.2. The results are presented for a single injection rate which is labeled as
60
Table 3.2 The percentage of TC-network improvement compared to timing based methods for different threshold values and traffic scenarios
crude-64 crude-256 sw-64 sw-256 fc3d-64 fc3d-256 TC-network
Traffic IR DD Er DD Er DD Er DD Er DD Er DD Er DD Er
Shuffle 0.14 50 40 33 30 33 25 20 17 25 17 16 12 0.10 0.3
Transpose 0.10 35 28 13 11 11 7.3 1.6 1.3 3.1 1.9 0.7 0.4 0 0
Butterfly 0.24 29 34 12 17 17 17 3.7 4.6 9.6 10 2.1 2.5 0 0
Random1 0.05 18 16 2.8 2.7 7.4 5.6 1.8 1.3 6.3 4.5 2.1 1.5 0.07 0.1
Random2 0.05 20 16 4.9 4.6 8.6 5.5 3.9 3.0 7.1 4.9 3.2 2.2 0.69 1.2
TC Improvement 176× 83× 75× 41× 89× 70× 36× 31× 59× 45× 28× 22× – –
‘crude’ refers to the work in [3], ‘sw’ refers to the work in [24] and ‘fc3d’ refers to the work in [25]. DD: Percentage of detected deadlocks, Er: Percentage
of total energy consumed to resolve detected deadlocks, IR: injection rate, Random1 : uniformly distributed traffic with four hot spots located at the corners,
Random2 : uniformly distributed traffic with four hot spots located at the center
R. Al-Dujaily et al.
3 Run-Time Deadlock Detection 61
the saturation packet injection rate (PIR),3 since the majority of detections occur
after the network saturation point. Table 3.2 illustrates the efficacy of the proposed
run-time deadlock detection method over the three different existing timing based
methods [3, 24, 25]. The TC-network method significantly outperforms timing-
based deadlock detection mechanisms by avoiding false detections and, as a result,
reducing energy wastage to resolve all detected deadlocks for all synthetic traffic
scenarios presented in the table. It should be noted that the quantitative analysis
presented in the last row of Table 3.2 as “TC-improvement” would vary according
to the injection rate and it is used here to summarize, on average, by what factor the
TC-network reduces the detected deadlocked packets compared to the timing-based
schemes [3, 24, 25]. Therefore, the TC-improvement magnitude is not necessarily
the same in the case where the evaluation merged the entire load range. The results
presented are for 8 × 8 mesh with packet sizes randomly chosen between 32 and 64.
3A network starts saturating when an increase in injection rate does not result in a linear increase
in throughput [6].
62 R. Al-Dujaily et al.
a b
Deadlocked packets [%]
c 25
Timeout−crude−64
Timeout−crude−256
20 Timeout−sw−64
Energy [%]
Timeout−sw−256
Timeout−fc3d−64
15 Timeout−fc3d−256
TC−network
10
0
0 0.05 0.1 0.15 0.2
Injection rate [flits/cycle/node]
Fig. 3.9 Performance evaluation of the proposed deadlock detection method (TC-network) and
the timeout mechanism under MMS traffic with different injection rates. (a) Percentage of detected
deadlocked packets to the total received packets. (b) NoC performance with different injection
rates. (c) Percentage of total energy consumed to resolve detected deadlocks to the total NoC
energy
using different deadlock detection methods. The last column in the table shows the
accumulated energy saving using the proposed detection method for the three IRs
presented in the same table. The energy saving is different for the different detection
methods utilized and for different threshold values; for instance, the TC-network can
save up to 5% of energy to run the 2 MB MMS traffic compared to the timing based
method [3] with a threshold value equal to 64.
It is crucial in designing NoCs that routers should not consume a large percentage
of silicon area compared to the IP core blocks. For this study, two fully adaptive
routers based on the timeout and the TC-network methods were designed in Verilog.
These were then synthesized using Synopsys Design Compiler and mapped onto the
UMC 90 nm technology library. Confirming well-known findings from, for instance
[6], the hardware synthesis result shows that the FIFO buffer area significantly
dominates the logic of the router. The buffer area mainly depends on the flit size.
For a flit size of 64 bits and FIFO buffers with a capacity of four flits, it was found
that the TC circuit adds only 0.71% area overhead to the total router area compared
Table 3.3 Energy consumption (mJ) to drain 2 MB of multi-media system data for different network loads and different deadlock detection methods
3 Run-Time Deadlock Detection
Method Injection rate = 0.05 Injection rate = 0.15 Injection rate = 0.2 Accumulated
used Etrans. Erecover Etotal Etrans. Erecover Etotal Etrans. Erecover Etotal energy saving (%)
TC-network 0.7152 0.0 0.7152 0.8040 0.0 0.8040 0.8128 0.0 0.8128 –
Timeout-crude-64 0.7002 0.0366 0.7368 0.6736 0.1696 0.8432 0.6693 0.2068 0.8761 5.05
Timeout-crude-256 0.7204 0.0035 0.7239 0.7548 0.0566 0.8114 0.7530 0.0787 0.8317 1.48
Timeout-sw-64 0.7181 0.0046 0.7226 0.7497 0.0621 0.8117 0.7653 0.0671 0.8324 1.47
Timeout-sw-256 0.7247 0.0004 0.7230 0.7911 0.0223 0.8134 0.7987 0.0301 0.8288 1.40
Timeout-fc3d-64 0.7136 0.0036 0.7171 0.7864 0.0362 0.8226 0.7873 0.0381 0.8244 1.36
Timeout-fc3d-256 0.7182 0.0002 0.7184 0.8090 0.0127 0.8218 0.7978 0.0155 0.8133 0.91
63
64 R. Al-Dujaily et al.
Table 3.4 Area and power contributions of the TC-unit and different
timeout circuits to the total router area and power
Module area to total Module power to total
Module name router area (%) router power (%)
TC-unit 0.71 0.44
Timeout-crude-1024 2.93 0.97
Timeout-crude-256 2.33 0.81
Timeout-crude-64 1.56 0.70
Timeout-sw-1024 3.12 1.16
Timeout-sw-256 2.54 0.98
Timeout-sw-64 1.76 0.91
Timeout-fc3d-1024 3.53 1.51
Timeout-fc3d-256 2.94 1.38
Timeout-fc3d-64 2.12 1.24
to 2.9% for the timeout [3] implementation with 10 bit threshold counter. Thus the
proposed method yields an area gain of more than 2% in each network router circuit
compared to the timeout implementation.
Power consumption is also an important system performance metric. The power
dissipated in each router design was determined by running Synopsys Design
Power on the gate-level netlist of the router with different random input data
streams as test stimuli. It was found that the power dissipated by the TC-unit
is 0.44% of the entire router power compared to 0.97% dissipated by the crude
timeout circuit, with 10 bit threshold counter. This provides a power saving of
0.53%. In the Intel TeraFlop 80-tiles chip implementation [31] the communication
power (Router + Links) estimation is 28% of the tile power profile. The Router
power, however, is 83% of the communication power. Taking these numbers into
consideration, and the power result from the proposed router synthesis, suggests
that the TC-unit will dissipate less than 0.01% of the total tile power in similar NoC
implementations.
Moreover, an investigation was carried out into the area and power contributions
of different timing-based deadlock detection methods’ circuits with different time-
threshold values and they were compared to the TC-unit circuit implementation.
The area gain and the power saving with the TC-unit implementation compared
to different timeout implementations can be observed in Table 3.4. The table
clearly shows that the techniques used in [24] and [25] introduce more area and
power overheads compared to the crude timeout implementation and this is due
to the extra hardware requirements required to implement these techniques. The
TC-network implementation, however, not only improves the performance of the
deadlock detection method, but also minimizes the area and power overheads, as
can be seen in Table 3.4.
However, the implementation of the TC-network needs some extra control
wires, as shown in Fig. 3.6. The tc_rx[n] input wires and tc_tx[m] output wires
to/from each TC-unit are coming/going to TC-units in neighbor router nodes.
3 Run-Time Deadlock Detection 65
TC−network convergence
0.15
convergence time for different k−ary 2−mesh (worst case)
k−ary 2−mesh (average case)
network topologies and sizes k−ary 2−cube (worst case)
time [ms]
0.1 k−ary 2−cube (average case)
0.05
0
0 100 200 300 400
Number of nodes as a function of k
TC_Networkwires = n + m + 4, (3.5)
where n is equal to the number of input channels and m is equal to the number
of output channels in each of the router nodes. Considering the data used in the
experiments (2D mesh network where n, m ∈ {North, East, South, West} and flit size
of 64 bits), the TC-Network wires are 12 in this case, which is less than 2% of the
total wiring cost of the NoC. This number is small and will become smaller with a
bigger flit size, as well as being independent of the capacity of the input and output
buffers.
In order to study the operation delays, first the TC-unit’s critical path gates delay
is calculated using the sdf file generated after synthesizing the circuit using worst
case library. Secondly, the interconnect delay calculation assumes the tiles are
arranged in a regular fashion on the floorplan with 2 × 1.5 mm tile size, similar to the
Intel TeraFLOPS chip [31]. The maximum interconnect length between routers is
2 mm. The load wire capacitance and resistance are estimated, using the Predictive
Technology Model (PTM) [28], to be 0.146 f F and 1.099 Ω/μm respectively. The
wire delay between TC units, TC-interconnect, can be readily calculated based on
the distributed RC model [8]. In reference to Table 3.1, the worst and average
convergence times of the TC-network for different NoC topologies can be estimated.
Figure 3.10 shows the worst and average times to discover a deadlock for different
network topologies and sizes. It is expected that the delay required by the TC-
network to converge to the desired output will depend on the network topology
and the size of the DES.
66 R. Al-Dujaily et al.
NoCs with adaptive routing are susceptible to deadlock, which could lead to
performance degradation or system failure. This chapter studies deadlock detection
and recovery, as opposed to deadlock avoidance. Detecting deadlocks at run-
time is challenging because of their highly distributed characteristics. This work
presents a deadlock detection method that utilizes run-time transitive closure (TC)
computation to discover the existence of deadlock-equivalence sets, which imply
loops of requests in NoCs. This detection scheme guarantees the discovery of all
true-deadlocks without false alarms, in contrast with state-of-the-art approximation
and heuristic approaches. A distributed TC-network architecture, which couples
with the NoC architecture, is also presented to realize the detection mechanism
efficiently.
The proposed method is rigorously evaluated using a cycle-accurate simulator
and synthesis tools. Experimental results confirm the merits and the effectiveness
of the proposed method. It drastically outperforms timing-based deadlock detection
mechanisms by eliminating false detections and, thus, reduces energy wastage to
recover from false alarms for various traffic scenarios, including real-world appli-
cation. The new method eliminates the need for any kind of time out mechanism
and delivers true deadlock detections independent of the network load and message
lengths, rather than approximating with congestion estimation, as in the existing
methods. It has been observed that timing based methods may produce two orders
of magnitude more deadlock alarms than the TC-network method. Moreover, the
hardware overhead for the TC-network has been examined. The implementations
presented in this chapter demonstrate that the hardware overhead of TC-networks is
insignificant.
References
31. S.R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob,
S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, S. Borkar, An 80-tile sub-100-w
teraflops processor in 65-nm CMOS. IEEE J. Solid State Circuits 43(1), 29–41 (2008)
32. V.I. Varshavsky (Ed.), Self-timed Control of Concurrent Processes: The Design of Aperiodic
Logical Circuits in Computers and Discrete Systems (Kluwer Academic, Norwell, MA, USA,
1990)
33. S. Warnakulasuriya, T. Pinkston, Characterization of deadlocks in interconnection networks,
in IPPS’97, Geneva-Switzerland (IEEE Computer Society, 1997), pp. 80–86
Chapter 4
The Abacus Turn Model
B. Fu () • Y. Han • H. Li • X. Li
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
e-mail: fubinzhang@ict.ac.cn; yinhes@ict.ac.cn; lihuawei@ict.ac.cn; lxw@ict.ac.cn
The power, memory and instruction-level parallelism walls are forcing the proces-
sors to integrate more and more cores. For example, Tilera’s TILE64 processor
integrates 64 cores [1], Intel’s single-chip cloud computer [2] and terascale pro-
cessor [3] have 48 and 80 cores, respectively.
To efficiently interconnect such a large number of elements, Network-on-Chip
(NoC) has been widely viewed as the replacement to the shared buses or dedicated
wires [1, 3, 4]. Generally, an NoC consists of routers and links. Adjacent routers are
connected by links according to the network topology, such as the ring (as shown
in Fig. 4.1a) and 2D meshes (as shown in Fig. 4.1b). These kinds of topologies are
popular because their planar structures facilitate the IC manufacturing. The recent
advances in 3D IC technologies have motivated the 3D topologies [5], such as 3D
meshes. In this chapter, most discussions are based on 2D meshes for simplicity.
The network topology determines the ideal performance of a network since
it determines the network diameter and the degree of path diversity. To achieve
the ideal performance, the network resources, such as buffer capacity and channel
bandwidth, are assumed to be allocated without any waste. Thus, an efficient flow
control technique is expected. To reduce buffer requirement and packet latency,
the wormhole flow control technique has become one of the main paradigms.
In wormhole switched networks, a packet is divided into fixed-size flow control
digits (flits), including a head flit, several payload flits, and a tail flit. Because a
packet could move across several nodes simultaneously, very short network delay is
incurred. However, once the head flit is blocked, all flits must stay along the path.
This, together with the fact that applications’ traffic tends to be very bursty, increases
the possibility of network blocking significantly.
Once a packet is blocked, its queueing delay will increase dramatically and
finally become the dominant contributor to latency. Increased network latency may
degrade applications’ performance or could even cause QoS violations. Traffic
load balancing techniques, such as traffic-aware (or congestion-aware) routing
algorithms, are promising ways to reduce network blocking [6–11]. Traffic-aware
routing algorithms make routing decisions based on run-time network status, so
that packets could be routed evenly along all legal paths. Some paths are illegal,
because routing packets along them may cause routing deadlock or livelock.
Generally, most of the traffic-aware routing algorithms are fully adaptive, so there
is always a possibility of making selection. However, current fully adaptive routing
algorithms may require a large number of virtual channels (VC) [12–14], or assume
a conservative flow control technique [15–17].
The VC, which was viewed as cheap and abundant, is expensive in the NoC
scenario [18]. First, increasing the number of VCs often means increasing the router
latency. The reason is that VA (VC allocation) delay increases with the number
of VCs and at most times VA is the critical stage of virtual-channel routers [19].
Second, increasing the number of VCs often means increasing the router area
since buffers are the main contributor to the router area and adding VCs often
requires more buffers. Routing algorithms following the Duato’s theory [15] assume
a conservative flow control technique that a queue never contains flits belonging
to different packets [15–17]. This facilitates reducing the number of VCs, but
usually degrades the performance of networks carrying many back-to-back short
packets [19], e.g., the control packets for cache coherence. To address this problem,
the whole-packet-forwarding flow control technique was recently proposed [20].
With this extension, a queue may contain multiple short packets. We should note that
though Duato’s theory [15] is adopted, the NoCs using state-of-the-art fully adaptive
routing algorithms require at least 2× VCs per physical channel. For example, for a
CMP system with directory-based cache coherence protocol, MSI for L1 cache and
MOSI for L2 cache, at least five virtual networks are required to avoid deadlock.
Therefore, totally five VCs per physical channel are required by NoCs using xy
routing, and at least 10 VCs by those using fully adaptive routing.
Unlike the fully adaptive routing algorithms, the VC-free partially adaptive
routing algorithms are more cost-efficient and can be realized with aggressive flow
control techniques [21,22]. Glass and Ni [21] proposed the turn model for designing
partially adaptive routing algorithms without VCs. The turn model classifies all
possible turns into clockwise and counter-clockwise abstract cycles. In each kind of
abstract cycles, one turn is prohibited to avoid deadlock. Based on the turn model,
literature [21] further proposed three partially adaptive routing algorithms, namely
the west-first, north-last and negative-first. Unfortunately, the degree of adaptiveness
of above algorithms is highly uneven. To address this problem, Chiu [22] proposed
the odd-even turn model, where NW1 and SW turns are prohibited in odd columns,2
and EN and ES turns are prohibited in even columns. Odd-even routing could
provide more even adaptiveness, however none of the long-distance (>2 hops) node
pairs are provided with full adaptiveness.
Both insufficient and uneven routing adaptiveness, of state-of-the-art partially
adaptive routing algorithms, may degrade the network performance. Applications’
traffic tends to be very bursty [23], and the location of hot spots varies with time. For
example, we show the traffic patterns of FFT during execution cycles 900 ∼ 1,000
and 1,500 ∼ 1,600 in Fig. 4.2a, b respectively. Figure 4.2a shows that node (3, 1)
1A packet takes an NW turn when it changes its direction from north to west [21].
2A column is called odd (respectively, even) column if its coordinate in dimension-x is an odd
(respectively, even) number [22].
72 B. Fu et al.
a b
3 3
2 2
1 1
0 0
0 1 2 3 0 1 2 3
Fig. 4.2 Traffic variations in FFT with 16 cores, (a) time = 900 ∼ 1,000, (b) time = 1,500 ∼ 1,600
is the hot spot during execution cycles 900 ∼ 1,000, and Fig. 4.2b shows that nodes
(2, 0) and (2, 3) become the hot spots after 500 cycles. To avoid congestions, we
expect that the packets towards the hot spots are provided with full adaptiveness.
Furthermore, since the positions of hot spots vary with time, the routing algorithm
is expected to be able to provide full adaptiveness to all node pairs. Unfortunately,
none of the state-of-the-art partially adaptive routing algorithms could meet these
requirements.
According to the above observations, the expected routing algorithm should be:
1. Time/space-efficient, i.e., no VC and routing table requirement,
2. Able to provide full adaptiveness to all node pairs.
It has been proved that it is impossible to design a deadlock-free fully adaptive
routing without VCs for a wormhole-switched mesh network [24]. Thus, the routing
algorithm should be partially adaptive. Partially adaptive routing algorithms cannot
provide full adaptiveness to all node pairs at all times. Therefore, the second
requirement is reduced to “able to” not “always” provide full adaptiveness to all
node pairs. In other words, the routing algorithms could provide full adaptiveness to
some node pairs at some time, and to others when the traffic pattern changes.
To achieve above objectives, routing algorithms should be reconfigurable. We
should distinguish the reconfigurable routing algorithm from the routing recon-
figuration algorithm. The former indicates that the routing algorithm could adjust
itself to adapt to network variations, such as the detection of faults or hot-spot
nodes. The routing reconfiguration algorithms, on the other hand, are proposed
to statically [25, 26] or dynamically [27–30] load a new routing algorithm, which
was computed offline, to replace the old one without introducing deadlock. Most
of the state-of-the-art reconfigurable routing algorithms are designed for tolerating
faults, such as [31–35]. The common feature of them is that the reconfiguration
is triggered by faults. If we simply view the nodes belonging to congestion
regions as faulty (temporally), the packets could be routed without entering
the congestion regions. However, according to fault-tolerant routing algorithms
[31–35], faulty nodes, actually those locating in congestion regions, are not allowed
to send and receive packets. Otherwise, the network may be deadlock. However,
4 The Abacus Turn Model 73
we cannot forbid sending packets to hot-spot nodes that are prone to be included
into congestion regions. Therefore, fault-tolerant reconfigurable routing algorithms
cannot be directly used to address the congestion problem. It is also possible to
reconfigure the network topology, such as [36, 37]. However, few of them addresses
the congestion problems.
Above observations encourage us to design a new reconfigurable routing algo-
rithm. The major challenge is to address the deadlock problem. When the a hot-spot
node is detected, there is no existing rule to follow to generate the new deadlock-
free routing algorithm. Generally, deadlock happens when packets wait for each
other in a cycle. To prevent deadlock, the VC-free routing algorithms should keep
the channel dependence graph (CDG) acyclic [38]. Due to the high complexity,
most of the cycle elimination algorithms, such as [39–42], are off-line solutions.
Turn models [21,22] provide a simple way to keep the CDG acyclic, but they are all
static and cannot be directly used to generate reconfigurable routing algorithms. To
address this problem, the Abacus Turn Model (AbTM), which will be discussed in
this chapter, was recently proposed in literature [43].
The turn model is proposed for n-dimensional meshes and k-ary n-cubes. Fol-
lowing the turn model, there are six basic steps to design a deadlock-free routing
algorithm.
1. Partition all virtual channels, except the wraparound channels, into sets according
to the virtual direction. Note that the virtual channels of a physical channel
are divided into separate sets since they have different virtual directions. All
wraparound channels are put into one set.
74 B. Fu et al.
i j
2. Identify all possible turns from one virtual direction to another, but ignore 0◦ and
180◦ turns.
3. Identify all abstract cycles, which are the simplest cycles in each plane of the
topology.
4. Prohibit one turn in each abstract cycle.
5. Incorporate turns from the set of wraparound virtual channels as many as possible
without introducing cycles.
6. Incorporate 0◦ and 180◦ turns as many as possible without introducing cycles.
Partitioning virtual channels into sets and using only the turns allowed by steps
4, 5 and 6 lead to that the virtual channels are accessed in a strictly increasing
or decreasing order. Thus, the network is deadlock-free. Furthermore, routing
algorithms based on turn model are livelock-free since a packet will definitely arrive
its destination due to the finite number of channels.
In the following part, we will show how turn model is applied to a 2D mesh.
Furthermore, we assume that there is no virtual channel in the network. A 2D mesh
has m × n nodes, where m (resp., n) is the radix of dimension x (resp., y). Each node
d has an address d : (dx , dy ), where dx ∈ {0, 1, 2, . . ., m − 1} and dy ∈ {0, 1, 2, . . . ,
n − 1}. Two nodes d : (dx , dy ) and e : (ex , ey ) are neighbors in dimension x (resp., y)
if and only if |dx − ex | = 1 and dy = ey (resp., |dy − ey | = 1 and dx = ex ). If two nodes
are neighbors in dimension x (resp., y), they are connected by a bidirectional row
(resp., column) channel. Each bidirectional row (resp., column) channel consists of
two physical channels with opposite directions: EW and WE (resp., NS and SN)
channels. Particularly, EW (resp., WE, NS, and SN) channel is used to forward
packets from east to west (resp., west to east, north to south, and south to north).
Each m × n mesh consists of m columns and n rows. Each row (resp., column)
consists of m (resp., n) nodes with the same coordinate in dimension y (resp., x).
A packet moving towards direction X makes an XY turn if it turns to direction
Y , where X,Y ∈ {E,W, N, S} and E (resp., W , N, and S) refers to direction east
(resp., west, north, and south). Of these 16 different combinations, 8 combinations
lead to 90◦ turns. For example, if a packet moving towards east (X = E) turns to
south (Y = S), then an ES turn is introduced (as shown in Fig. 4.3a). Four of these
combinations lead to 0◦ turns. A 0◦ turn is possible as long as there are more than
one channel in a direction. For example, Fig. 4.3i shows that X and Y are both
4 The Abacus Turn Model 75
east (E). The last four combinations lead to 180◦ turns. A 180◦ turn is introduced if
X and Y are opposite. For example, Fig. 4.3j shows that a packet makes a 180◦ turn
by changing its direction from east to the west.
Eight 90◦ turns can be divided into two sets, i.e., the clockwise and counter-
clockwise turns, according to the rotation direction. For example, ES, SW, WN and
NE turns are clockwise turns, and EN, NW, WS and SE turns are counter-clockwise
turns. As shown in Fig. 4.4a, four clockwise turns form the clockwise abstract cycle,
and the other four counter-clockwise turns form the counter-clockwise abstract
cycle.
Turn model avoids deadlock by prohibiting one turn in each abstract cycle.
Because there are 4 different turns in each abstract cycle, there are totally 16
different combinations to prohibit the 2 turns. Of these 16 combinations, 4 com-
binations are illegal. The reason is that making three clockwise turns continuously
is equivalent to making one counter-clockwise turn. For example, Fig. 4.5a shows
that making SW, WN and NE turns continuously is equivalent to making a counter-
clockwise SE turn. Similarly, making EN, NW and WS turns continuously is
equivalent to making a clockwise ES turn. Therefore, the combination (ES, SE)
is illegal since a cycle also can be formed by the allowed six turns as shown in
Fig. 4.5c. Due to the same reason, three other combinations are illegal, i.e., the
combinations (SW, WS), (WN, NW) and (NE, EN).
Of these 12 legal combinations, only 3 combinations are unique if rotation
symmetry is considered. As shown in Fig. 4.6, they are named as west-first,
negative-first, and north-last, respectively. In 2D meshes, there is no wraparound
turn, so that the step 5 can be ignored. In step 6, we incorporate all 0◦ turns, and
forbid all 180◦ turns for simplicity.
With west-first routing algorithm, the SW and NW turns are forbidden respec-
tively in clockwise and counter-clockwise abstract cycles. West-first routing can be
proved to be deadlock-free by numerating channels in such a way that channels
are accessed in a strictly decreasing order. Due to the limited space, the detailed
proof is omitted here. Since packets are not allowed to turn to the west during
the transmission, packets should be first routed to the west if the destination is
76 B. Fu et al.
a b c
Fig. 4.6 Three unique combinations considering rotation symmetry, (a) west-first routing,
(b) north-last routing, (c) negative-first routing
to the west of the source, i.e., dx < sx . Therefore, although the west-first routing
is adaptive, there is only one path for northwestward and southwestward packets as
shown in Eq. 4.1, where Swest− f irst represents the degree of adaptiveness of the west-
first routing. Obviously, this is unfair, and the congestions caused by northwestward
and southwestward packets cannot be avoided by west-first routing.
⎧
⎪ (x + y)! , if dx ≥ sx
⎪
⎨
x!y!
Swest− f irst = (4.1)
⎪
⎪
⎩1, otherwise
North-last routing does not allow packets moving towards north to make turns
by forbidding NE and NW turns, as shown Fig. 4.6b. Like the west-first routing,
north-last routing is not even either as shown in Eq. 4.2. For southeastward and
southwestward packets, north-last is fully adaptive. However, for packets with dy >
sy , north-last routing is equivalent to XY routing.
⎧
⎪
⎪ (x + y)!
⎨ , if dy ≤ sy
x!y!
Snorth−last = (4.2)
⎪
⎪
⎩1, otherwise
NS SN
SW EN
proposed to prohibit ES turn in even columns and SW turn in odd columns. To break
the counter-clockwise rightmost columns, EN and NW turns are forbidden in even
and odd columns, respectively.
The routing algorithms, which are based on odd-even turn model, are deadlock-
free as long as the 180◦ turns are prohibited. This can be easily proved by
contradiction. Assuming that a deadlock happens, thus there should be some
channels waiting for each other make a cycle according to Dally and Seitz’s
theory [38]. Since there is no 180◦ turn, the cycle must has both row and column
channels. In this cycle, there should be a rightmost column line segment, which
consists of a sequence of column channels with the same direction. Let nodes S
and E be the start and the end nodes of this rightmost segment, respectively. As
shown in Fig. 4.8a, if these channels are NS channels, an ES and SW turn should be
made at the start and the end nodes, respectively. If the rightmost column belongs
to an odd column, then the SW turn at the end node is prohibited. Contradiction
arises. Otherwise, if the column is an even column, then the ES turn at the start node
is prohibited. Contradiction also arises. If the channels are SN channels, the proof
is similar. The conclusion is that both clockwise and counter-clockwise rightmost
column segments cannot be formed with odd-even turn model, thus the channel
dependence cycles never form in the network. Therefore, the network is deadlock-
free. This conclusion is important, because it is also the foundation of the AbTM.
The major advantage of the odd-even turn model over the original turn model is
that routing algorithms following odd-even turn model provide more even routing
adaptiveness to different source-destination node pairs. For simplicity, we name
the minimal routing algorithm following the odd-even turn as odd-even routing.
Calculating the adaptiveness of odd-even routing is more complicated than that
of west-first, north-last and negative-first routing. Let’s define x = dx − sx and
y = dy − sy , where (sx , sy ) and (dx , dy ) are the coordinates of source and destination
nodes, respectively. The packet is called an NE (resp., SE, NW, and SW) packet if
x > 0 and y ≥ 0 (resp., x > 0 and y < 0, x ≤ 0 and y ≥ 0, and x ≤ 0
and y < 0). The even column, which allows NW and SW turns, is an allowable
column for NW and SW packets. The odd column, which allows EN and ES turns,
dx dx − 1
is an allowable column for NE and SE packets. Let h = and h = ,
2 2
the adaptiveness of odd-even routing can be calculated as Eq. 4.4 for NE and SE
packets, and as Eq. 4.5 for NW and SW packets.
⎧
⎪
⎪ (|y| + h)!
⎪
⎪ , i f column sx is an allowable column
⎪
⎪ |y|!h !
⎪
⎨ and |x| is an odd number
Sodd−even−(NE+SE) =
⎪
⎪
⎪
⎪
⎪
⎪ (|y| + h)!
⎪
⎩ , otherwise
|y|!h!
(4.4)
4 The Abacus Turn Model 79
⎧
⎪
⎪ (|y| + h)!
⎪
⎪ , i f column sx is an allowable column
⎪
⎪ |y|!h!
⎪
⎨ or |x| = 0
Sodd−even−(NW +SW ) =
⎪
⎪
⎪
⎪
⎪
⎪ (|y| + h)!
⎪
⎩ , otherwise
|y|!h !
(4.5)
Comparing with Eqs. 4.1–4.3, we could find that the adaptiveness of odd-even
routing is much more even than that of west-first, north-last and negative-first
routing. This benefit stems from the fact that different turns are prohibited in odd and
even columns. However, from Eqs. 4.4 and 4.5, we find that odd-even routing does
not provide fully adaptiveness for any source-destination node pair, which has the
distance larger than 2. Without full adaptiveness, the ability to reduce the chances
that packets are blocked continuously degrades largely. The major reason that causes
the limitations of turn model and odd-even turn is that routers prohibit some kinds of
turns statically. For example, all routers prohibits NW and SW turns with west-first
routing. In fact, turns are also resources. Allocated with more turns, there are more
chances to route around congestions for a packet. The turn model and odd-even
turn model allocate the turn resources statically regardless of the network status.
Next section, the AbTM, which can dynamically allocate turn resources, will be
discussed.
The AbTM inherits the basic idea of [22] that “a network is deadlock-free if
both clockwise and counter-clockwise rightmost columns are removed from the
network”, but it uses a more flexible way to realize it. To form a clockwise rightmost
column, as shown in Fig. 4.9a, the ES turn should be above an SW turn. Thus, there
are three ways to avoid the formation of the rightmost column. The first way is
to prohibit the SW turn at all routers as shown in Fig. 4.9b. The second way is to
a ES
b c d
NS
prohibit the ES turn at all routers as shown in Fig. 4.9c. Third, we also could remove
the clockwise rightmost column by forbidding all ES turns above any SW turn as
shown in Fig. 4.9d. Thus, in each column, there is a node, above which all ES turns
are prohibited, and below which all SW turns are prohibited. In this way, there is
no ES turn above any SW turn in any column, thus no clockwise rightmost column
can be formed. We name this kind of node as clockwise bead. Actually, the first two
ways are special cases of the third one. For example, the first way can be realized
by moving the bead to the top of the column. Note that the top node also allows the
SW turn, but this turn will never be used. Similarly, in each column, there is also a
counter-clockwise bead, above which NW turn is prohibited and below which EN
turn is prohibited. Thus, the counter-clockwise rightmost column is removed from
the network too.
A 4 × 4 mesh is compared to an abacus as shown in Fig. 4.10a. Each col-
umn is viewed as a wire with two sliding beads, i.e., the clockwise bead and
counter-clockwise bead. Clockwise (resp., counter-clockwise) beads, which are
separately controlled in different columns, are utilized to regulate the distribution
of clockwise turns (resp., counter-clockwise turns). Generally, AbTM rules can be
summarized as:
1. Nodes above clockwise (resp., counter-clockwise) bead prohibit ES (resp., NW)
turn,
2. Nodes below clockwise (resp., counter-clockwise) bead prohibit SW (resp., EN)
turn,
4 The Abacus Turn Model 81
a b c
6 7 8 6 8 6
3 4 5 3 4 5 3 4 5
1 1 2
clockwise and counter-clockwise beads to the bottom and the top in each column,
respectively. After a 90◦ rotation in the counter-clockwise direction, the east-last
routing becomes the north-last routing.
By now, designing a deadlock-free routing algorithm is reduced to assigning the
positions of clockwise and counter-clockwise beads inside each column. As long
as the positions are determined, the network is deadlock-free. We prove this as the
following theorem.
Theorem 1. The network following abacus turn model is deadlock-free.
Proof. We prove this theorem by contradiction. We assume a network following
abacus turn model is deadlock, so that there should be a turn dependence cycle
in the network according to [21]. According to [22], there must be a clockwise or
counter-clockwise rightmost column in this cycle. Without the loss of generality,
we assume it is a clockwise rightmost column. Therefore, there must be a node, Nx ,
allowing ES turn above a node, Ny , allowing SW turn. According to the abacus turn
model, Nx should be the clockwise bead or below it, and Ny should be the clockwise
bead or above it. Therefore, Nx cannot be above Ny . Contradiction arises.
Compared with turn model [21] and odd-even turn model [22], the major
advantage of AbTM is that it is dynamically reconfigurable. For each k × k mesh,
there are two beads in each column and k potential positions for each bead, thus
there are k × k combinations for each column and totally (k × k)k combinations for
a network. That means there are (k × k)k different deadlock-free routing algorithms
can be realized with AbTM in a k × k mesh. For example, for a 4 × 4 mesh,
there are 65,536 different combinations. Once the beads are moved, the routing
algorithm is reconfigured. Therefore, we could move the beads in each column up
or down to optimize the network performance under different network status. The
following example shows how does the AbTM-based reconfigurable routing tackle
the congestion problem.
Example 1. As shown in Fig. 4.14a, clockwise beads are initially located on the
bottom row. Thus, routers above beads prohibit the ES turn. Meanwhile, a new hot
spot (node-5) is detected, and node-6 is expected to send packets to it in a relatively
long period. Because there is only one available minimal path between them, this
path is highly prone to congestions. To balance traffic, node-6 wants to have more
84 B. Fu et al.
available paths. Thus, it makes a complaint to node-7 about the prohibited ES turn
at each time it tries to send packets. Meanwhile, node-7 collects the complaints,
and negotiates with the bead holder, i.e., the node-1. The holder will evaluate the
requirements, and determine whether to give up the bead. Here, node-1 will pass
the bead up as shown in Fig. 4.14b. Receiving the bead, node-7 could enable the
ES turn according to the AbTM. Thus, node-6 could exploit two available paths to
balance the traffic. Similarly, node-7 will make complaints to node-8. Finally, node-
8 also gets the ownership of the clockwise bead as shown in Fig. 4.14c. By now, all
minimal paths could be used to forward packets from node-6 to node-5 to reduce
the congestion.
Following the AbTM, the network is always deadlock-free no matter where the
clockwise and counter-clockwise beads locate in columns. Thus, for a reconfig-
urable routing algorithm, its main job is to decide the distributions of beads in each
column to optimize the network performance. In other words, the reconfigurable
routing algorithm should determine the time and direction of the bead movement.
According to the AbTM, four turns, WN, NE, WS and SE, are always enabled
since they are not critical. Other four turns, ES, SW, EN, and NW, are updated
with the bead movement. These four turns can be further classified into two groups,
i.e., clockwise group (ES and SW) and counter-clockwise group (EN and NW).
Generally, moving a bead means disabling a turn in the old bead holder and enabling
the other turn in the same group in the new holder. Thus, the turn requirement is a
nature metric to determine the direction of bead movement.
With the requirements of each turn, moving bead can be compared to moving
a block (mass is 0) as shown in Fig. 4.15. The bead is compared to the block.
For example, the north neighbor give a force, Fup, which reflects its eagerness for
ES turn, to pull up the clockwise bead. On the other hand, the south neighbor
represents its eagerness for SW turn by giving a force, Fdown . The current holder
gives a friction, f , to prevent the motion. The direction of f is always opposite to
the direction of bead movement. For example, if Fup > Fdown , the bead is going to be
moved up. Thus, f reflects holder’s eagerness for SW turn to prevent the movement.
Otherwise, it reflects its eagerness for ES turn to prevent moving down the bead.
Furthermore, a threshold (T h) is set to prevent “shaking” that a bead is frequently
4 The Abacus Turn Model 85
moved up and down. Therefore, the force summation should be bigger than the T h to
move the bead. For moving counter-clockwise beads, Fup reflects north neighbor’s
eagerness for NW turn, and Fdown reflects south neighbor’s eagerness for EN turn.
Mathematically, the forces can be described as Eqs. 4.6 ∼ 4.9, where CTxy and
CTxy−z are the counters used to record the times that the xy turn was required at
the bead holder and the z neighbor respectively. For example, the upward force
on the clockwise bead is CTes−n as shown in Eq. 4.6, where CTes−n is the counter
used to record the times that the ES turn was required at the north (n) neighbor.
As shown in Eq. 4.7, the downward force on the clockwise bead is CTsw−s , which
represents the times that the SW turn was required at the south neighbor. Each router
maintains four counters, i.e., CTes , CTsw , CTen and CTnw , each for one critical turn.
These counters are updated once the corresponding turn is required. We say a turn
is required if a packet makes this turn or a complaint about this turn is received.
As discussed in the above example, a router makes a complaint to its neighbor
once it detects that the neighbor prohibits a turn which is needed by itself. Without
the “complaint” mechanism, the counters record how many times the turn was
used instead of the times the turn was required or needed. Thus, the “complaint”
mechanism is important for AbTM-based reconfigurable routing algorithms. In [43],
complaints are only transferred between neighbors. If a router could accurately
make complaints to all routers which will be used by the packets being processed,
then the reconfigurable routing will be more powerful.
CTes−n , clockwise;
Fup = (4.6)
CTnw−n , counter−clockwise.
CTsw−s , clockwise;
Fdown = (4.7)
CTen−s , counter−clockwise.
⎧
⎪
⎪CTsw , clockwise & Fup > Fdown ;
⎪
⎪
⎨CT , clockwise & Fup < Fdown ;
es
f= (4.8)
⎪
⎪CTen , counter−clockwise & Fup > Fdown;
⎪
⎪
⎩
CTnw , counter−clockwise & Fup < Fdown ;
86 B. Fu et al.
⎧
⎪
⎪CTes−n − CTsw−s − CTsw ,
⎪
⎪
clockwise & CTes−n >CTsw−s ;
⎪
⎪
⎪CTsw−s − CTes−n − CTes ,
⎨ clockwise & CTes−n <CTsw−s ;
Fsummation = CTnw−n − CTen−s − CTen , counter−clockwise & CTnw−n >CTen−s ;
⎪
⎪
⎪
⎪CTen−s − CTnw−n − CTnw , counter−clockwise & CTnw−n <CTen−s ;
⎪
⎪
⎪
⎩0, else.
(4.9)
Arm-wrestling algorithm is a local mechanism, which only considers the turn
requirements of current holder as well as its direct north and south neighbors.
The local mechanism simplify the implementation, but the requirements of farther
neighbors may be not fulfilled in time. To address this problem, the tug-war
algorithm that differs from Arm-wrestling in the definition of Fup and Fdown was
proposed.
Basically, tug-war is a non-local mechanism. For the tug-war algorithm, the
nodes above bead are viewed as a group, their total requirements to a corresponding
turn are defined as the Fup . As shown in Eq. 4.10, the total requirements for ES
(resp., NW) turn of all north neighbors are viewed as the Fup on clockwise (resp.,
counter-clockwise) beads. On the other hand, nodes below bead are viewed as an
another group, and their total requirements to the turn are defined as the Fdown .
Equation 4.11 shows that the Fdown on clockwise (resp., counter-clockwise) bead is
the total requirements for SW (resp., EN) turn of all south neighbors. To get the total
requirement, each router increases its local requirement by the requirement received
from upstream nodes, and propagates the result to the downstream node through
out-of-band dedicated wires. As shown in Eq. 4.12, to highlight the requirement of
nodes closer to bead, the total requirements are divided by 2 at each node.
Tes−n , clockwise;
Fup = (4.10)
Tnw−n , counter−clockwise.
Tsw−s , clockwise;
Fdown = (4.11)
Ten−s , counter−clockwise.
Moving beads is a hard job due to the difficulty of keeping network deadlock-free.
Taking Fig. 4.16a as an example, node-1 holds the clockwise bead. Thus, it allows
the SW turn, and node-4 could utilize it to forward packets to node-0. As shown
in Fig. 4.16b, node-1 will prohibit SW turn after the movement of bead. If node-4
4 The Abacus Turn Model 87
0 2 0 1 2
does not notice that change in time, the packets sent to node-0 could be blocked at
node-1. Furthermore, if node-7 enables ES turn when the SW turns, to be disabled
by node-1 and node-4, are still hold by packets, then a clockwise rightmost column
maybe formed. Thus, the AbTM is violated.
To solve these problems, bead movement should follow two rules:
1. A turn can be disabled iff no packet requires it,
2. A turn can be enabled iff there is no packet holding the other turn which is needed
to form a rightmost column.
In general, h turns should be prohibited and h other turns should be enabled
after a h-hops movement of a bead. For example, to move the bead from node-1
to node-7, SW turns on node-1 and node-4 should be prohibited, and ES turns on
node-4 and node-7 should be enabled. Thus, it is a hard job to meet above two
rules simultaneously since a large number of turns will be involved, especially in
a large-scale network. To reduce the complexity of bead movement, it is divided
into h steps, within which the bead is moved up/down just one hop. This basic
step is named as bead passing and is viewed as a safe operation. Exploiting this
safe operation has two advantages. First, bead passing has a good scalability since
it can be locally realized by the cooperation of adjacent routers. Second, bead
passing itself can guarantee the network deadlock-free during the reconfiguration.
Therefore, the routing designers do not need to consider the deadlock problem when
designing their own reconfigurable routing algorithms.
Within each step, only one turn should be prohibited at original bead holder.
For example, to move bead from node-1 to node-4, node-1’s SW turn should be
prohibited. Before disabling the turn, node-1 should notify node-4 to stop injecting
southwest packets3 which may require the SW turn on node-1. On receiving the
notification, node-4 labels its south output port as “southwest unaccepted”. After
then, all southwest packets will be routed to node-3 since the WS turn is always
allowed. However, node-4 cannot send acknowledgement to node-1 right now,
because there may be southwest packets that have already been routed based on
the old information. Thus, node-4 should drain such kind of packets first before
sending the acknowledgement.
3 Southwest packets are those whose destination is on the southwest to current node.
88 B. Fu et al.
a b
Fig. 4.17 The pseudo code of bead passing, (a) clockwise bead passing, (b) counter-clockwise
bead passing
Generally, there are two ways to implement routing algorithms, i.e., logic-based
and table-based. Logic-based routing algorithms, such as xy routing, are time/space
efficient. However, they are lack of flexibility for reconfiguration. Table-based
solution is flexible, but it may introduce large area and timing overhead. Recently, a
logic-based distributed routing (LBDR) was proposed [18]. LBDR routing provides
enough flexibility to implement AbTM-based reconfigurable routing algorithms.
As shown in Fig. 4.18a, LBDR routing provides an effective way to implement
90 B. Fu et al.
Fig. 4.18 The routing logic of original (a) LBDR and (b) LBDRe
distributed routing by utilizing eight routing bits (Ren , Res , Rse , Rsw , Rwn , Rws , Rne ,
and Rnw ) and 4 connectivity bits (Ce , Cs , Cw , and Cn ). The routing bits represent
whether the corresponding neighbor accepts such kind of packets. For example,
Ren = 1 means that the east neighbor accepts the northeast packets by allowing the
EN turn. The connectivity bits indicate whether the neighbor is connected.
The LBDR routing is separated into two steps. The first step is carried out by the
CMP module that compares the coordinates of current router and destination router,
i.e., the Cx , Cy , Dx , and Dy as shown in Fig. 4.18a. The results point out the directions
4 The Abacus Turn Model 91
which can take the packet closer to the destination. For example, E = 1 indicates
that the destination is on the east of the current route. The second step checks the
routing and connectivity bits, and determines whether the corresponding output
ports can be taken. For instance, the east output can be used if the east neighbor
is connected, i.e., Ce = 1, and one of the following three conditions holds.
1. The destination is on the east of the current router, i.e., E · N̄ · S̄ = 1,
2. The destination is on the northeast and the east neighbor allows the EN turn, i.e.,
E · N · Ren = 1,
3. The destination is on the southeast and the east neighbor allows the ES turn, i.e.,
E · S · Res = 1.
The LBDR routing may reduce the adaptivity of implemented routing algorithms
due to its local visibility. To tackle this problem, the authors proposed the LBDRe
that has two-hops visibility. The LBDRe routing is separated into three steps. The
first step is still carried out by the CMP module. The results indicate the directions
where the packets should be routed to reach the destination, as well as whether
the destination is two-hops away along that direction. For example, as shown in
Fig. 4.18b, if the destination is on the east, then E = 1; if the destination is also
two-hops away, then E2 = 1.
The second step checks whether the turn made at the current router is allowable.
To this end, each router adds 8 routing restriction bits, i.e., the RRxy ,4 and should
remember the packets’ input channel represented by ipl (local), ipe (east), ipw (west),
ips (south), and ipn (north). Furthermore, packets from local input are allowed to be
routed through any output port, and the packets with 0◦ -turn, such as those from
west input to east output, are always allowable.
The third step checks the routing restrictions of neighbors. Besides the direct
neighbors, LBDRe also checks the restrictions of two-hops neighbors. Thus, as
shown in Fig. 4.18b, LBDRe requires two more routing bits for each output port.
The added routing bits, labeled as R2xy , indicate whether the packet can take a XY
turn at the router two hops away from the current router through the x direction. For
example, R2en is true means that the router, two hops away from the current router
through the east direction, allows the EN turn. To sum up, an output port, such as
the east (E) port, is selected at this step, one of the following conditions holds:
1. The destination is on the east of the current router, i.e., E · N̄ · S̄ = 1,
2. The destination is on the northeast and the east neighbor allows the EN turn, i.e.,
E · N · Ren = 1,
3. The destination is on the southeast and the east neighbor allows the ES turn, i.e.,
E · S · Res = 1,
4. The destination is on the northeast and two-hops along the east direction as well
as the two-hops east neighbor allows the EN turn, i.e., E2 · N · R2en = 1,
5. The destination is on the southeast and two-hops along the east direction as well
as the two-hops east neighbor allows the ES turn, i.e., E2 · S · R2es = 1.
It is worthy to note that LBDR routing can be applied to any topology, such
as mesh, +, b, d, p, and q topologies, where any node can communicate with
others through one minimal path defined in the original mesh. Since AbTM assumes
WN, NE, ES, and SE turns are always enabled, we could remove four routing
bits, Rse , Rwn , Rws , and Rne . Furthermore, a reconfiguration module realizing the
bead-passing-based arm-wrestling or tug-war algorithms should be added. The
reconfiguration module is used to update these four routing bits based on the
runtime network status according to arm-wrestling or tug-war algorithms. Because
reconfiguration module does not add delay to the critical path, the time efficiency of
LBDR routing is inherited.
d e
7 8 6 7 8
4 5 3 4 5
0 1 2 2
intermediate node could send the packet to its south neighbor because SE turn is
always enabled. Because any two nodes are connected initially, the ES turn at node-
5 should be enabled. Hence, the clockwise bead is or above node-5. Without the
loss of generality, we assume node-5 holds clockwise bead. To prohibit the ES turn
at node-5, the bead should be pulled down. To pull down the bead, the SW turn
requirement of nodes below node-5 should be larger than the ES turn requirement
of nodes above node-5 according to arm-wrestling and tug-war. However, the
SW requirements of nodes below node-5 is always 0 due to the absence of west
neighbors. Therefore, clockwise bead will never be below node-5, then the ES turn
at node-5 is always enabled.
The proofs for node pairs requiring SW, EN, and NW turns are similar and
omitted. We should note that, in Fig. 4.19c, node-4 does not have the west neighbor,
so that the SW turn at node-1 cannot be prohibited. In this case, the ES turn
requirements of node-7 is also 0 since it does not have west neighbor either.
Otherwise, if the node-6 exits, the network topology is a horizontally-reversed “C”
which is not supported by LBDR routing. In Fig. 4.19d, node-7 does not have the
west neighbor due to the same reason.
Theorem 3. AbTM-based routing does not drop and suspend packets during the
reconfiguration.
Proof. Generally, a packet is dropped by an intermediate node if there is no route
for that packet. According to Lemma 1, AbTM-based routing provides at least one
minimal path for any node pair at any time. Therefore, dropping packet will not
happen.
According to Lemma 1, there is always a minimal path, which will not be
changed by the reconfiguration, between any two nodes. Therefore, any packet
can proceed along the unchanged minimal path without suspension during the
reconfiguration.
94 B. Fu et al.
4.5 Evaluation
4.5.1 Methodology
AbTM provides a safe way to dynamically tune the routing algorithm, where “safe”
means that the reconfiguration and reconfigured routing algorithm are both deadlock
free. Therefore, we select four baseline routing algorithms also proposed to address
the “safe” problem. These algorithms include a deterministic routing algorithm:
the xy routing, two partially adaptive routing algorithms: west-first [21] and odd-
even routing [22], and a minimal fully adaptive routing [15]. Recently proposed
routing algorithms are used to improve the load balancing (or address the “port
selection” problem), such as CQR [45], O1Turn [46], and RCA [10]. Thus, AbTM
is orthogonal to them.
With 2D meshes, adaptive routing algorithms may produce two candidate output
ports at each hop. In such cases, the port with more credits is selected. All routers,
except [15], assume one VC per virtual network, and each VC is assigned with a 4-
entries queue. For [15], two VCs per virtual network are provided to avoid routing
deadlock. For fair comparison, its input buffer queue is equipped with two entries.
Furthermore, [15] does not reallocate an VC until the tail flit leaves. Other routing
algorithms, however, can reallocate an VC as soon as the tail flit is received.
We first evaluate the routing algorithms using synthetic traffic patterns, including
uniform, transpose and hotspot, in a 4 × 4 mesh. To show the scalability, a 8 × 8
mesh is simulated under the same traffic patterns. These routing algorithms are
4 The Abacus Turn Model 95
a b
Fig. 4.20 Average packet latency under Uniform traffic patterns, (a) 4 × 4 mesh, (b) 8 × 8 mesh
Under the uniform traffic pattern, each node sends packets to others with the
same possibility. Figure 4.20a, b show the simulation results for the 4 × 4 and
8 × 8 meshes, respectively. Horizontal axis represents the injection rate in the
number of flits per node per cycle, and the vertical axis indicates the average
packet latency in router cycles. It has been proved that deterministic routing
algorithms could get better performance under the uniform traffic pattern than
adaptive routing algorithms [21, 22]. The main reason is that adaptive routing
96 B. Fu et al.
a b
Fig. 4.21 Average packet latency under transpose traffic patterns, (a) 4 × 4 mesh, (b) 8 × 8 mesh
algorithms usually make selection based on local information, such as the number
of credit of each output port. As shown in Fig. 4.20a, xy routing algorithm gets
the best results as expected. Following the xy routing, west-first and odd-even, get
similar results. AbTM-based routing algorithms are worse than partially adaptive
routing algorithms mainly due to the reconfiguration is based on the traffic history.
Unfortunately, under uniform traffic pattern, such kind of reconfiguration will be
wrong at most times. For example, we assume that northeast direction carries the
highest traffic load in the current interval, so that AbTM-based will allocate more
adaptivity to that direction. However, northeast direction will carry the lowest traffic
load in the next interval with very high possibility according to the definition of
uniform traffic. The minimal fully adaptive routing algorithm [15], labeled as “min-
adapt”, gets the worst results due to its conservative flow control strategy that an
VC cannot be reallocated until it is empty. Figure 4.20b shows the results for the
8 × 8 mesh. Most of the results are in according with that shown in Fig. 4.20a.
However, the relative performance of min-adapt routing algorithm get improved
in this simulation. The reason is that the contention possibility increases with the
network size. When contention occurs, [15] requires packets to wait on the escape
output port, i.e., the one chosen by xy routing. Thus, the min-adapt routing algorithm
could exploit the long-term evenness of uniform traffic pattern just like the xy
routing.
Real world applications usually generate nonuniform traffic patterns, such as
transpose. Under transpose traffic pattern, source node s always sends packets to
the destination d, where di = s(i+b/2)%b and b is the number of bits used to index
nodes. As shown in Fig. 4.21a, the AbTM-based routing algorithms get the best
results because they could provide full adaptiveness to all packets. On the other
hand, xy routing gets the worst performance since it cannot address the congestion
problem. West-first gets better results than xy routing by providing full adaptiveness
to eastward packets. However, the improvement is limited since westward packets
4 The Abacus Turn Model 97
a b
Fig. 4.22 Average packet latency under Hotspot traffic patterns, (a) 4 × 4 mesh, (b) 8 × 8 mesh
still face with the congestion problem. Odd-even routing could provide even
adaptiveness to packets towards different directions, thus it could get better results
than west-first and xy routing. However, it cannot get the performance as high as the
AbTM-based routing algorithms due to the insufficient routing adaptiveness. Min-
adapt routing algorithm also could provide full adaptiveness, but its conservative
flow control technique aggregates the congestion problem. Figure 4.21b shows
the simulation results in a 8 × 8 mesh. The relative performance is same, but the
improvement got by the proposed algorithms is enlarged.
Applications’ bursty traffic may cause hotspots, and aggregates the congestion
problem. In the following two simulations, we assume four hot-spot nodes: n0 ,
n4 , n8 , n12 . Each hot-spot node gets extra 20 % traffic than others. These four
nodes are selected to simulate the situation that four memory controllers are
frequently accessed. In such case, westward packets are prone to be congested.
As shown in Fig. 4.22a, the AbTM-based routing algorithms get the best results
since they could provide full adaptiveness by allowing the NW and SW turns at
each router. Min-adapt routing algorithm could provide full adaptiveness, but its
conservative flow control technique leads to a low utilization of input buffers. Thus,
its performance is worse than odd-even routing. Since west-first and xy routing
algorithms cannot provide adaptiveness to avoid congestions, they get the worst
performance. When the network is enlarged, the congestion problem is enlarged
too. By successfully reducing blocking, the AbTM-based routing algorithms also
get the best performance with enlarged gap.
application’s traffic [51]. If we view the burst as hotspot traffic, then the application
consists of hotspot traffic in consecutive intervals with different hot-spot nodes.
If AbTM-based routing could reconfigure itself in time, we could expect the
application performance improvement. The following trace-driven simulations are
carried out to prove this assumption.
Figure 4.23a shows the packet latency, which is normalized to the latency of xy
routing, across Splash-2 benchmarks. By dynamically allocating more adaptivity
to bursty packets, the AbTM-based routing could significantly reduce the packet
latency. Generally, both arm-wrestling and tug-war routing algorithms could reduce
packet latency for all applications. However, due to applications’ different charac-
teristics, the improvement is different. For example, for applications generating high
contented traffic, such as fft and water-spatial, the improvement is significant. For
applications generating low contented traffic, such as raytrace and ocean, however,
4 The Abacus Turn Model 99
a b
Fig. 4.24 Area overhead comparison, (a) router area evaluation, (b) tile area evaluation
4.5.2.3 Overhead
The router frequency is one of the most important performance metrics to evaluate
NoCs. To achieve the high network performance, it should be as high as possible.
Although AbTM-based routing algorithms dynamically update routing bits of
LBDR routing, they do not modify its routing logic and add delay to the critical
path. This together with the fact that LBDR routing has been proved to be
timing efficient [18] prove that the AbTM-based routing algorithms are also timing
efficient.
Another kind of overhead is the area overhead, which determines the chip cost.
Figure 4.24a shows the router area normalized to area of xy router. According to
the results, the router realizing min-adapt routing algorithm is the largest because it
requires 5 VC allocators. Other routers do not have the VC allocators since they do
not implement virtual channels. The tug-war router is the second largest, because
it requires more logics to propagate traffic information. Due to the low complexity,
west-first and odd-even routers are smaller than the Arm-wrestling router. Compared
with xy router, Arm-wrestling and tug-war routers increase the area by 5 and 8 %,
respectively. Taking the area of core and caches into consideration, the area increase
is negligible. As shown in Fig. 4.24b, the increase of tile area will be less than 1 %
with the assumption that the router area occupies 11 % area of a tile as reported by
Intel [3].
100 B. Fu et al.
Assuming the same VLSI technology and working frequency, the dynamic power
of a router is largely determined by its size and activity. As discussed above,
the routers do not have significant difference in area. Thus, they consume similar
amount of dynamic power as the activity of them does not have significant dif-
ference either. Compared with normal routing activities, reconfigurations consume
negligible power since they rarely happen. If we further take the power of cores and
caches into consideration, the power overhead is negligible.
4.6 Summary
References
2. http://www.intel.com/content/www/us/en/research/intel-labs-single-chip-cloudcomputer.html
3. S.R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob,
S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, S. Borkar, An 80-tile sub-100-w
teraflops processor in 65-nm CMOS. IEEE J. Solid-State Circuits 43(1), 29–41 (2008)
4. W.J. Dally, B. Towles, Principles and Practices of Interconnection Networks (Morgan Kauf-
mann, San Francisco, 2004)
5. J. Kim, C. Nicopoulos, D. Park, R. Das, Y. Xie, N. Vijaykrishnan, M.S. Yousif, C.R. Das,
A novel dimensionally-decomposed router for on-chip communication in 3D architectures, in
Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA’07,
San Diego (ACM, New York, 2007), pp. 138–149
6. J. Kim, D. Park, T. Theocharides, N. Vijaykrishnan, C.R. Das, A low latency router supporting
adaptivity for on-chip interconnects, in Proceedings of Design Automation Conference, San
Diego, 2005, pp. 559–564
7. A. Singh, W.J. Dally, A.K. Gupta, B. Towles, Goal: a load-balanced adaptive routing algorithm
for torus networks, in Proceedings of International Symposium on Computer Architecture, San
Diego, 2003, pp. 194–205
8. J.W. van den Brand, C. Ciordas, K. Goossens, T. Basten, Congestion-controlled best-effort
communication for Networks-on-Chip, in Design, Automation Test in Europe Conference
Exhibition, Nice, 2007, pp. 1–6
9. J. Duato, I. Johnson, J. Flich, F. Naven, P. Garcia, T. Nachiondo, A new scalable and cost-
effective congestion management strategy for lossless multistage interconnection networks, in
Proceedings of International Symposium on High-Performance Computer Architecture, San
Francisco, 2005, pp. 108–119
10. P. Gratz, B. Grot, S.W. Keckler, Regional congestion awareness for load balance in Networks-
on-Chip, in Proceedings of IEEE International Symposium on High Performance Computer
Architecture, Salt Lake City, 2008, pp. 203–214
11. D. Park, R. Das, C. Nicopoulos, J. Kim, N. Vijaykrishnan, R. Iyer, C.R. Das, Design of a
dynamic priority-based fast path architecture for on-chip interconnects, in Proceedings of IEEE
Symposium on High-Performance Interconnects, Stanford, 2007, pp. 15–20
12. W.J. Dally, Virtual-channel flow control. IEEE Trans. Parallel Distrib. Syst. 3(2), 194–205
(1992)
13. S.A. Felperin, L. Gravano, G.D. Pifarre, J.L.C. Sanz, Fully-adaptive routing: packet switching
performance and wormhole algorithms, in Proceedings of ACM/IEEE Conference on Super-
computing, Albuquerque, 1991, pp. 654–663
14. W.J. Dally, H. Aoki, Deadlock-free adaptive routing in multicomputer networks using virtual
channels. IEEE Trans. Parallel Distrib. Syst. 4(4), 466–475 (1993)
15. J. Duato, A new theory of deadlock-free adaptive routing in wormhole networks. IEEE Trans.
Parallel Distrib. Syst. 4(12), 1320–1331 (1993)
16. Y.M. Boura, C.R. Das, Efficient fully adaptive wormhole routing in n-dimensional meshes, in
Proceedings of International Conference on Distributed Computing Systems, Poznan, 1994,
pp. 589–596
17. J.H. Upadhyay, V. Varavithya, P. Mohapatra, Efficient and balanced adaptive routing in two-
dimensional meshes, in Proceedings of IEEE Symposium on High-Performance Computer
Architecture, Raleigh, 1995, pp. 112–121
18. J. Flich, S. Rodrigo, J. Duato, An efficient implementation of distributed routing algorithms
for NoCs, in Proceedings of ACM/IEEE International Symposium on Networks-on-Chip,
Newcastle, 2008, pp. 87–96
19. L.-S. Peh, W.J. Dally, A delay model and speculative architecture for pipelined routers, in
Proceedings of International Symposium on High-Performance Computer Architecture, Nuevo
Leone, 2001, pp. 255–266
20. S. Ma, N.E. Jerger, Z. Wang, Whole packet forwarding: efficient design of fully adaptive
routing algorithms for Networks-on-Chip, in Proceedings of the 2012 IEEE 18th International
Symposium on High-Performance Computer Architecture, HPCA’12, New Orleans (IEEE
Computer Society, Washington, DC, 2012), pp. 1–12
102 B. Fu et al.
21. C.J. Glass, L.M. Ni, The turn model for adaptive routing, in Proceedings of International
Symposium on Computer Architecture, Gold Coast, 1992, pp. 278–287
22. G.M. Chiu, The odd-even turn model for adaptive routing. IEEE Trans. Parallel Distrib. Syst.
11(7), 729–738 (2000)
23. N. Barrow-Williams, C. Fensch, S. Moore, A communication characterisation of splash-2
and parsec, in IEEE International Symposium on Workload Characterization, Austin, 2009,
pp. 86–97
24. L.M. Ni, P.K. McKinley, A survey of wormhole routing techniques in direct networks.
Computer 26(2), 62–76 (1993)
25. T.L. Rodeheffer, M.D. Schroeder, Automatic reconfiguration in autonet, in Proceedings of
ACM Symposium on Operating Systems Principles, Pacific Grove, 1991, pp. 183–197
26. O. Lysne, J.M. Montanana, J. Flich, J. Duato, T.M. Pinkston, T. Skeie, An efficient and
deadlock-free network reconfiguration protocol. IEEE Trans. Comput. 57(6), 762–779 (2008)
27. O. Lysne, J. Duato, Fast dynamic reconfiguration in irregular networks, in Proceedings of
International Conference on Parallel Processing, Toronto, 2000, pp. 449–458
28. R. Casado, A. Bermudez, J. Duato, F.J. Quiles, J.L. Sanchez, A protocol for deadlock-free
dynamic reconfiguration in high-speed local area networks. IEEE Trans. Parallel Distrib. Syst.
12(2), 115–132 (2001)
29. D. Avresky, N. Natchev, Dynamic reconfiguration in computer clusters with irregular topolo-
gies in the presence of multiple node and link failures. IEEE Trans. Comput. 54(5), 603–615
(2005)
30. R. Casado, A. Bermudez, F.J. Quiles, J.L. Sanchez, J. Duato, Performance evaluation of
dynamic reconfiguration in high-speed local area networks, in Proceedings of International
Symposium on High-Performance Computer Architecture, Toulouse, 2000, pp. 85–96
31. J. Wu, A fault-tolerant and deadlock-free routing protocol in 2D meshes based on odd-even
turn model. IEEE Trans. Comput. 52(9), 1154–1169 (2003)
32. Z. Zhang, A. Greiner, S. Taktak, A reconfigurable routing algorithm for a fault-tolerant
2D-mesh Network-on-Chip, in Proceedings of ACM/IEEE Design Automation Conference,
Anaheim, 2008, pp. 441–446
33. D. Fick, A. DeOrio, G. Chen, V. Bertacco, D. Sylvester, D. Blaauw, A highly resilient routing
algorithm for fault-tolerant NoCs, in Proceedings of Design, Automation Test in Europe
Conference Exhibition, Nice, 2009, pp. 21–26
34. B. Fu, Y. Han, H. Li, X. Li, A new multiple-round dimension-order routing for Networks-on-
Chip. IEICE Trans. Inf. Syst. E94-D, 809–821 (2011)
35. B. Fu, Y. Han, H. Li, X. Li, Zonedefense: a fault-tolerant routing for 2-d meshes without virtual
channels. IEEE Trans. Very Large Scale Integr. Syst. (2013)
36. L. Zhang, Y. Han, Q. Xu, X. Li, Defect tolerance in homogeneous manycore processors
using core-level redundancy with unified topology, in Design, Automation and Test in Europe,
DATE’08, Munich, 2008, pp. 891–896
37. L. Zhang, Y. Han, Q. Xu, X. Li, H. Li, On topology reconfiguration for defect-tolerant noc-
based homogeneous manycore systems. IEEE Trans. Very Large Scale Integr. Syst. 17(9),
1173–1186 (2009)
38. W.J. Dally, C.L. Seitz, Deadlock-free message routing in multiprocessor interconnection
networks. IEEE Trans. Comput. C-36(5), 547–553 (1987)
39. A. Mejia, J. Flich, J. Duato, S.-A. Reinemo, T. Skeie, Segment-based routing: an efficient fault-
tolerant routing algorithm for meshes and tori, in Proceedings of International Symposium on
Parallel and Distributed Processing, Rhodes Island, 2006, p. 10
40. M. Palesi, R. Holsmark, S. Kumar, V. Catania, Application specific routing algorithms for
Networks-on-Chip. IEEE Trans. Parallel Distrib. Syst. 20(3), 316–330 (2009)
41. M.A. Kinsy, M.H. Cho, T. Wen, E. Suh, M. van Dijk, S. Devadas, Application-aware deadlock-
free oblivious routing, in Proceedings of International Symposium on Computer Architecture,
Austin (ACM, New York, 2009), pp. 208–219
4 The Abacus Turn Model 103
42. J. Cong, C. Liu, G. Reinman, Aces: application-specific cycle elimination and splitting for
deadlock-free routing on irregular Network-on-Chip, in Proceedings of ACM/IEEE Design
Automation Conference, Anaheim, 2010, pp. 443–448
43. B. Fu, Y. Han, J. Ma, H. Li, X. Li, An abacus turn model for time/space-efficient reconfigurable
routing, in Proceedings of the 38th Annual International Symposium on Computer Architec-
ture, ISCA’11, San Jose (ACM, New York, 2011), pp. 259–270
44. Y. Han, Y. Hu, X. Li, H. Li, A. Chandra, Embedded test decompressor to reduce the required
channels and vector memory of tester for complex processor circuit. IEEE Trans. Very Large
Scale Integr. Syst. 15(5), 531–540 (2007)
45. A. Singh, W.J. Dally, A.K. Gupta, B. Towles, Adaptive channel queue routing on k-ary n-cubes,
in Proceedings of ACM Symposium on Parallelism in Algorithms and Architectures, Barcelona,
2004, pp. 11–19
46. D. Seo, A. Ali, W.-T. Lim, N. Rafique, M. Thottethodi, Near-optimal worst-case throughput
routing for two-dimensional mesh networks, in Proceedings of International Symposium on
Computer Architecture, Madison, 2005, pp. 432–443
47. M.M.K. Martin, D.J. Sorin, B.M. Beckmann, M.R. Marty, M. Xu, A.R. Alameldeen,
K.E. Moore, M.D. Hill, D.A. Wood, Multifacet’s general execution-driven multiprocessor
simulator (GEMS) toolset. ACM SIGARCH Comput. Archit. News 33(4), 92–99 (2005)
48. S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, A. Gupta, The splash-2 programs: characterization
and methodological considerations, in Proceedings of International Symposium on Computer
Architecture, Santa Margherita Ligure, 1995, pp. 24–36
49. C. Bienia, S. Kumar, J.P. Singh, K. Li, The parsec benchmark suite: characterization and
architectural implications, in Proceedings of International Conference on Parallel Architec-
tures and Compilation Techniques, PACT’08, Toronto (ACM, New York, 2008), pp. 72–81
50. E.S. Shin, V.J. Mooney III, G.F. Riley, Round-robin arbiter design and generation, in
Proceedings of International Symposium on System Synthesis, Kyoto, 2002, pp. 243–248
51. A.B. Kahng, B. Lin, K. Samadi, R.S. Ramanujam, Trace-driven optimization of Networks-
on-Chip configurations, in Proceedings of ACM/IEEE Design Automation Conference (DAC),
Anaheim, 2010, pp. 437–442
52. S. Rodrigo, J. Flich, A. Roca, S. Medardoni, D. Bertozzi, J. Camacho, F. Silla, J. Duato,
Addressing manufacturing challenges with cost-efficient fault tolerant routing, in Proceedings
of International Symposium on Networks-on-Chip, Grenoble, 2010, pp. 25–32
Chapter 5
Learning-Based Routing Algorithms
for On-Chip Networks
5.1 Introduction
In this section, we first investigate a minimal and adaptive routing algorithm, and
then the learning-based approach is built upon this minimal routing. We called the
obtained method, LM (Learning method applied to Minimal routing).
A basic requirement to distribute messages over the network is to have some degree
of adaptiveness when routing messages. We utilize a minimal and fully adaptive
routing algorithm named Dynamic XY (DyXY) [8]. In this algorithm, which is
based on the static XY algorithm, a message can be sent either to the X or Y
dimension. DyXY uses one and two virtual channels along the X and Y dimensions,
respectively. This is the minimum number of virtual channels that can be employed
to provide fully adaptiveness.
The network can be proved to be deadlock-free as follows. The network is
partitioned into two sub-networks, one covering + X and the other one covering –
X. In this way, the sub-networks are disjoint along the X dimension. Moreover,
the first and second sub-network use the first and second virtual channel along
the Y dimension, respectively, and thus the sub-networks are disjoint along the Y
dimension as well. Each switch in DyXY has seven pairs of channels, i.e. East (E),
West (W), North-vc1 (N1), North-vc2 (N2), South-vc1 (S1), South-vc2 (S2), and
Local (L).
Let us explain the learning method by this assumption that the source s sends a
message to the destination d through one of its neighboring switches x (see Fig. 5.1).
The time it will take for a message to reach from the source s to the destination d is
bounded by the sum of three quantities [6]: (1) Bx : the waiting time of the message
in the input buffer of the switch x (2) δ : the transmission delay over the link from
the switch s to the switch x (3) Qx (n,d): the time it would take to send the message
from the switch x to the destination switch d via the least congested neighboring
switch (e.g. the switch n). This value is extracted from the Q-Table of the switch x.
By sending the message over the link from the switch s to the switch x, the
transmission delay δ is obtained. However, this value is considered as negligible in
this work. To measure the Bx value, instead of the waiting time in the input buffer, we
use the number of occupied slots in the input buffer of the switch. Finally, Qx (n,d)
can be extracted from the Q-Table of the switch x as soon as the output channel of
the message is determined. These quantities are summed together and form a new
estimated latency from the switch s to the destination d obtained at the switch x.
108 M. Ebrahimi and M. Daneshtalab
S S
Destinations
Destinations
. .
. .
D - 3 - - 4 - D - 6 - - 4 -N
. . ew
New -Es
. -Est . tim
n ima n ate
te
Local=BX=1
Global=minQX(D,Y)=4
c d
BZ=2
Z D Z D
X Y X Y
BY=3
S S
. .
. .
D - 5 - - 6 - D - - - - 0 -
. .
. New .
-Est
n ima n
te
Local=By=3 Local=BZ=2
Global=minQY(D,Z)=5 Global=minQZ(D,D)=0
This information is sent back to the upstream switch s and updates the old estimated
latency. This is performed by averaging the old and new estimated latency values.
Figure 5.1 shows an example where a message is sent from the source s to
the destination d. As illustrated in Fig. 5.1a, the source switch s maintains a table
including the estimated latency values it takes for a packet to reach from this switch
to different destinations (i.e. 1 to n destinations in an n × n network). Each entry
of this table belongs to one destination switch in the network and each column
determines the output channels of the switch.
5 Learning-Based Routing Algorithms for On-Chip Networks 109
Based on the minimal routing algorithm, DyXY, the message can be sent either
through the output channel E or N2 from the switch s to the switch d (Fig. 5.1a).
According to the Q-Table values, sending the message through the output channel
N2 leads to the lowest latency, so the message is delivered through this channel.
When the message arrives to the switch x (Fig. 5.1b), it has to be waited in the input
buffer before grant access to one of output channels. This waiting time is modeled by
the occupied slots in the input buffer of the switch x (local estimated latency: Bx ).
At the switch x, there are two output channels to deliver the message toward the
destination as E and N2. Based on the Q-Table of the switch x, the output channel
E has a smaller estimated latency to reach the destination switch than the output
channel N2 (global estimated latency: Qx ), and thus the message is sent through the
output channel E. At this point, by summing up the local (Bx ) and global (Qx (y,d))
estimated latencies, a new value is obtained which shows the estimated latency from
the switch s to the switch d. By using congestion wires, this information sends
back to the switch s to update the corresponding entry (row: d; column: N2) of
the table. The data message continues its path toward the destination by passing
through the switch y. Similarly, after determining the output channel, local and
global information is sent back to the switch x through congestion wires. This new
information updates the corresponding entry of the switch x (row: d; column: E).
A similar procedure is applied when the message arrives to the switch z. Finally
the message reaches the destination and the congestion information is sent back to
the previous switch. As can be obtained from this example, by delivering a single
message from a source to a destination, only one entry from each table is updated.
However, Q-Tables are gradually updated by propagating different messages within
the network over time.
As shown in Fig. 5.2a, typically the size of a Q-Table is n × m × k where n is the
number of switches in the network, m is the number of output channels per switch,
and k is the size of each entry in the Q-Table. Since a message can be delivered
through at most two directions, the size of Q-Table can be decreased to n × 2 × k. We
call this table, the Q-Routing table. As shown in Fig. 5.2b, the number of columns
can be reduced to two; one is allocated to the X dimension and the other one to the
Y dimension. Obviously, with a larger number of virtual channels, more number of
columns is needed.
110 M. Ebrahimi and M. Daneshtalab
Minimal routing algorithms can deliver messages through at most two minimal
directions and they cannot reroute messages around congested areas. They suffer
from a low degree of adaptiveness which are inefficient in distributing traffic over
the network even if they have accurate knowledge of the network condition. In this
section, we propose a non-minimal routing algorithm, named HARA, with a high
degree of adaptiveness to provide more output options at each switch [9] and then
we apply a leaning method on top of it [10]. We called the obtained method, LNM
(Learning method applied to Non-Minimal routing).
Fig. 5.3 (a) A switch in a double-Y network; (b) 0-degree-vc; (c) 0-degree-ch; (d) 90-degree;
(e) 180-degree-vc; (f) 180-degree-ch
5 Learning-Based Routing Algorithms for On-Chip Networks 111
a b
c d
Fig. 5.4 (a) 90-degree turns in vc1; (b) 90-degree turns in vc2; (c) 0-degree-ch; (d) 0-degree-vc
In order to avoid deadlock, the Mad-y method [11] prohibits some turns in the
double-Y network. For example, as shown in Fig. 5.4d, 0-degree-vc turns from
vc2 to vc1 may cause deadlock in the network and they are prohibited. The other
0-degree turns such as the 0-degree-ch turns (Fig. 5.4c) and the 0-degree-vc turns
from vc1 to vc2 (Fig. 5.4d) are permitted. As illustrated in Fig. 5.4a, b, out of sixteen
90-degree turns that can be potentially taken in a network, four of them cannot be
used in Mad-y. Finally, 180-degree turns are not allowed in Mad-y. To prove the
freedom of deadlock, a two-digit number (a,b) is assigned to each output channel
of a switch in an n × m mesh network. According to the numbering mechanism,
a turn connecting the input channel (Ia ,Ib ) to the output channel (Oa ,Ob ) is called
an ascending turn when (Oa > Ia ) or ((Oa = Ia ) and (Ob > Ib )). Figure 5.5 shows
112 M. Ebrahimi and M. Daneshtalab
Whole Whole
forbiden Y<((n-2)/2) Y>(n/2) forbiden
network network
a b c d e f g h i j
the numbers assigned to each channel for a switch at position (X,Y). Since this
numbering mechanism causes the messages to take the permitted turns in the strictly
increasing order, so that Mad-y is deadlock-free.
As Mad-y is a minimal and adaptive routing method, it cannot fully utilize the
eligible turns to route messages through non-minimal but less congested paths. The
aim of the proposed non-minimal routing algorithm, HARA, is to enhance the
capability of the existing virtual channels in Mad-y to reroute messages around
congested areas and hotspots. Since the Mad-y and HARA methods combine
two virtual channels with different prohibited turns, they diminish the drawbacks
of turn models prohibiting certain turns at all locations. In minimal routings,
(e.g. Mad-y), 180-degree turns are prohibited but they can be incorporated in non-
minimal routings (e.g. HARA). One way to incorporate 180-degree turns is to
examine them one by one to see whether each turn leads to any cycle in the network.
After determining all allowable turns and in order to prove deadlock-freeness, the
numbering mechanism is utilized.
In HARA, however, we use the numbering mechanism of the Mad-y method to
learn all 180-degree turns that can be taken in the ascending order, and then modify
the mechanism to meet our requirements. According to this numbering mechanism
shown in Fig. 5.5, among 180-degree-vc turns, those from vc1 to vc2 are taken in the
ascending order (Fig. 5.6a, b), so that it is safe to employ them in the network. As all
180-degree-vc turns from vc2 to vc1 take place in the descending order, thereby they
cannot be used in the network (Fig. 5.6c, d). Now, let us examine the 180-degree-
ch turn connecting the first virtual channel of the north output port to the same
virtual channel of the north input port (Fig. 5.6e). As shown in Fig. 5.5, the label
on the north output channel with vc1 is (m−1−x,1 + y) and the label on the input
channel of the north direction along the same virtual channel is (m−1−x,n−1−y).
The turn takes place in the ascending order if and only if n−1−y is greater than
1 + y. Therefore, this turn can be safely added to a set of allowable turns if the Y
coordinate of a switch is less than (n−2)/2. Similarly, in Fig. 5.6f, the 180-degree-
ch turn on the vc2 of the north direction is permitted if the Y coordinate of a
switch is less than (n−2)/2. 180-degree-ch turns on the south direction (either on
vc1 or vc2) are permitted if and only if the Y coordinate of a switch is greater than
n/2 (Fig. 5.6g, h). Finally, the 180-degree-ch turn on the west direction is always
permitted (Fig. 5.6i) while the 180-degree-ch turn on the east direction is prohibited
in the network (Fig. 5.6j).
5 Learning-Based Routing Algorithms for On-Chip Networks 113
(m-1-x,2n+2-y)
(m+x,2n+2-y)
(m-1-x,1+y)
(m+x,1+y)
(m+x,0) (m+1+x,0)
X,Y
(m-x,0) (m-1-x,0)
(m-1-x,2n+3-y)
(m+x,2n+3-y)
(m-1-x,y)
(m+x,y)
Fig. 5.7 The numbering mechanism of HARA along with all eligible turns in the network
As shown in Fig. 5.6e–h, there are four conditional 180-degree turns; two of
those are allowable only in the northern part of the network and two others in the
southern part of the network. This not only increases the complexity of the routing
function but also imposes heterogeneous routing function to switches. To overcome
this issue, we modify the numbering mechanism such that two turns are permitted
in all locations of the network (Fig. 5.6g, h) and two others are prohibited in the
whole network (Fig. 5.6e, f). The numbering mechanism of HARA along with all
permitted turns in the network is shown in Fig. 5.7. As can be observed from this
figure, all allowable turns are taken in the ascending order.
In the following we prove that HARA is deadlock-free and livelock-free.
Theorem 1 HARA is deadlock-free
Proof If the numbering mechanism guarantees that all eligible turns are ordered in
the ascending order, no cyclic dependency can occur between channels. As can be
observed from Fig. 5.7, all connections between input channels and output channels
to form the eligible turns in HARA take place in the ascending order and thus HARA
is deadlock-free.
Theorem 2 HARA is livelock-free
Proof In HARA, when a message moves to the east direction, it can never be routed
back to the west direction. Therefore, in the worst case, the message may reach to
the leftmost column and then moves to the east direction toward the destination
column without the possibility of routing to the west direction again. Therefore,
after a limited number of hops, the message reaches the destination, and Theorem 2
is proved.
In the non-minimal routing, only eligible turns can be employed at each switch
but it is not sufficient to avoid blocking in the network. In fact, there is no possibility
of creating cycles but messages might be blocked forever. The reason is that by
utilizing the allowable turns, a message may not be able to find a path to the
destination from the next hop and it is blocked. On the other hand, one of the
aims of HARA is to fully utilize all eligible turns to present a low-restrictive
adaptive method in the double-Y network. To achieve the maximal adaptiveness
114 M. Ebrahimi and M. Daneshtalab
without the blocking issue, for each combination of the input channel and the
destination position, we examined all eligible 0-degree, 90-degree, and 180-degree
turns, separately. The output channels are selected in a way that not only the turn
is allowable but also it is guaranteed that there is at least one path from the next
switch to the destination switch. When a message arrives through one of the input
channels, the routing unit determines one or several potential output channels to
deliver the message. The routing decision is based on the relative position of
the current and the destination switch which is within one of the following eight
cases: north (N), south (S), east (E), west (W), northeast (NE), northwest (NW),
southeast (SE), and southwest (SW). All permissible output channels of HARA,
for each pair of the input channel (inCh) and destination position (pos) are shown
in Table 5.1. The adaptivity provided by Mad-y is illustrated in Table 5.2. By
comparing these two tables, it can be easily obtained that HARA offers a large
degree of adaptiveness to route messages. One of the drawbacks of non-minimal
methods is in their complexity due to considering different conditions in the routing
decisions. However, as shown in Fig. 5.8 (i.e. extracted from Table 5.1), the
implementation of HARA is very simple.
Figure 5.9 shows an example of the HARA method in a 5 × 5 mesh network in
which the source switch 7 sends a message to the destination switch 14. According
to Table 5.1, the message arriving from the local channel and going toward the
destination in the northeast position has six alternative choices (i.e. N1, N2, S1, S2,
E, and W); among them, the output channels N1, N2, and E introduce the minimal
paths and S1, S2, and W indicate the non-minimal paths. Since the neighboring
switches in the shortest paths are in the congested area, the message is sent to a
neighboring switch located in a non-minimal path that is less congested. Again, at
the switch 2, all the minimal paths are congested, so the message is sent to the switch
1 which is less congested. The same strategy is repeated until the message reaches
the destination switch. This example shows the capability of the HARA method to
reroute messages around the congested areas.
We utilize an optimized learning model for the selection function of the proposed
non-minimal approach. This method is called LNM (Learning method applied to
Non-Minimal routing). As already mentioned, the size of a Q-Table is n × m × k
where n is the number of switches in the network, m is the number of output channels
per switch, and k is the size of each entry in the Q-Table. This size can be decreased
to n × 2 × k in the Q-Routing table. However, the required area of Q-Routing is still
very large and increases as the network size enlarges.
In non-minimal approaches, the number of columns cannot be reduced to two as
a message might be sent through any of output channels. To address the size and
Table 5.1 Potential output channels offered by HARA
pos
inCh N S E W NE NW SE SW
L N1,N2, S1, W N1, S1,S2, W N1,N2, S1,S2, E,W N1, S1, W N1,N2, S1,S2, E,W N1, S1, W N1,N2, S1,S2, E,W N1, S1, W
N1 N2, S1, W S1,S2, W N2, S1,S2, E,W S1, W N2, S1,S2, E,W S1, W N2, S1,S2, E,W S1, W
N2 – S2 S2,E – S2,E – S2,E –
S1 N1,N2, S1, W N1, S1,S2, W N1,N2, S1,S2, E,W N1, S1, W N1,N2, S1,S2, E,W N1, S1, W N1,N2, S1,S2, E,W N1, S1, W
S2 N2 – N2,E – N2,E – N2,E –
E N1,N2, S1, W N1, S1,S2, W N1,N2, S1,S2, E,W N1, S1, W N1,N2, S1,S2, E,W N1, S1, W N1,N2, S1,S2, E,W N1, S1, W
5 Learning-Based Routing Algorithms for On-Chip Networks
15 16 17 18 19
10 11 12 13 14
5 6 7 8 9
0 1 2 3 4
scalability problem of Q-Tables, we proposed a new table for the LNM approach
which is called Region-based Routing table (R-Routing). As illustrated in Fig. 5.10,
each row of this table corresponds to one of the eight different positions of the
destination switch (i.e. N, S, E, W, NE, NW, SE, and SW) and each column indicates
5 Learning-Based Routing Algorithms for On-Chip Networks 117
an output channel (i.e. N1, N2, S1, S2, E, and W). Regardless of the network size,
the size of R-Routing tables is 8 × 6 × k that is considerably smaller than Q-Routing
and Q-Tables. There is another type of tables, called C-Routing which decreases the
size of Q-Tables by taking advantages of the clustering approach [12]. The size of C-
Routing tables is (l + c) × m × k, consisting of two parts: 1) the cluster part having a
size of c × m × k where c is the number of clusters 2) the local part having a size of
l × m × k where l is the number of switches in each cluster. The clustering approach
suffers from the scalability issue since the size of C-Routing tables can become
rather large as the network scales up. There are some other concerns regarding the
clustering model such as determining the cluster size for different network sizes or
partitioning the network when the network size is not a multiple of the cluster size.
The required sizes for Q-Routing, C-Routing, and R-Routing tables are reported
in Table 5.3, where k = 4 (the size of each entry in a table) and l = 4 (the number
of switches within each cluster for the clustering approach,). Note that the reported
areas for Q-Routing and C-Routing tables are based on the assumption in which no
virtual channels is used while the size of R-Routing table is measured considering
one virtual channel along the Y dimension. As can be seen in this table, not only
the sizes of R-Routing tables are very small but also they are independent of the
network size.
While the sizes of R-Routing tables are small enough to be applicable in
NoCs, someone might think that by using R-Routing tables, the accuracy of the
estimated latency values toward each destination is reduced. In fact, under real traffic
conditions, each entry of a R-Routing table is inherently influenced by the switches
118 M. Ebrahimi and M. Daneshtalab
which are in more communication with the current switch at that period. Therefore,
it is not necessary to allocate a row for every switch in the network. Moreover, R-
Routing tables are updated more occasionally than Q-Routing and C-Routing tables
since messages designated for the same region can be used to update R-Routing
tables while in two other models, each entry is updated only by the messages for the
same destination.
Now, let us explain the LNM approach using the example of Fig. 5.11 where a
message is generated at the source switch s for the destination switch d. According
to HARA, when a message arrives from the local input channel and the destination is
to the northeast position, six output channels can be selected to forward the message
(i.e. N1, N2, S1, S2, E, and W). In Fig. 5.11a, the colored entry of the Q-Table
indicates the estimated latencies of a message from each possible output channel to
the northeast region. Since the output channel N1 has the lowest estimated latency,
the message is delivered from this output channel toward the destination switch.
At the switch x, the message is received by the input channel S1 (Fig. 5.11b).
According to the information indicated in Table 5.1, multiple output channels can
be used to forward the message (i.e. N1, N2, S1, S2, E, and W). Among eligible
output channels, the output channel E has the lowest latency, and thus it is selected
for sending the message to the switch y. At this time, the local and global congestion
values should be returned to the switch s. The number of occupied buffer slots at the
input buffer of the switch x is counted as the local information (i.e. in this example
Bx = 1). The minimum estimated latency of routing messages from the switch x
to the destination region via the neighboring switch y is considered as the global
latency and it is extracted from the Q-Table of the switch x (i.e. in this example
minQx (y,d) = 4). By summing up the local and global information, a new estimation
is obtained which shows the latency from the switch s to the destination d. Finally,
the corresponding entry of the Q-Table at the switch s (i.e. row: NE; column: N1)
should be updated with the new value. This is done by taking the average of the old
and new latency estimations (Fig. 5.11a).
At the switch y, the message is received via the west input channel (Fig. 5.11c).
Among the three possible output channels (i.e. N2, S2, and E), the one with the
lowest latency is selected. Upon connecting the input channel to the output channel
of the switch y, local and global information are returned to the switch x. The local
congestion shows the number of occupied slots at the input buffer of the switch y (i.e.
By = 3) while the global congestion indicates the estimated latency from the switch
y to the destination switch d via the neighboring switch z (i.e. minQy (z,d) = 5).
The sum of the local and global values is a new latency estimation to reach the
destination from the switch x. As shown in Fig. 5.11b, the corresponding entry of
the Q-Table at the switch x is updated taking an average of the new estimated value
(i.e. By + minQy (z,d)) and an existing estimation (Qx (y,d)).
Finally, the message arrives at the switch z from the input channel S2 (Fig. 5.11d).
This message can reach the destination by sending through the output channel N2
or E. The output channel E has the lowest latency value and it is selected for
routing the message. The local latency (i.e. the number of occupied slots at the
input buffer of the switch z) is 3 while the global latency to the destination is
5 Learning-Based Routing Algorithms for On-Chip Networks 119
equal to 0 as the message reaches the destination in the next hop. Similarly, the
latency values are returned to the switch y and the Q-Table is updated with this
information (Fig. 5.11c). Hence, as messages are propagated inside the network,
Q-Tables gradually incorporate more global information [13].
120 M. Ebrahimi and M. Daneshtalab
Q-Routing models have an initial learning period during which it performs worse
than minimal schemes. The reason for this temporal inefficiency is that there is a
possibility of choosing non-minimal paths even if the network is not congested.
To cope with this problem, in the initialization phase, all entries of Q-Tables are
initialized such that minimal output channels are set to “0000” and non-minimal
output channels are set to “1000” and never can be less than it. Accordingly, in a
low traffic condition, only the shortest paths are selected while non-minimal paths
are used to distribute traffic when the network gets congested.
In order to transfer the congestion information, LNM utilizes a 4-bit wire
between each two neighboring switches. The local congestion information is a 2-
bit value indicating the congestion level of an input buffer. The global congestion
information is a 4-bit value which provides a global view of the latency from
the output channel of the current switch to the destination region. This global
information is extracted from the corresponding entry of the R-Routing table. The
Q-Values are updated whenever a message is propagated between two neighboring
switches. Suppose that a message is sent from the switch x to the destination switch
d by passing through the neighboring switch y and then the switch z with the lowest
estimated latencies. At the switch y, upon connecting the input channel to the output
channel, 2-bit local and 4-bit global values are aggregated into a 4-bit value (with
the maximum value of “1111”) and then it is transferred to the switch x. This value
is a new estimation of the latency from the selected output channel of the switch
x to the destination d. The corresponding entry of the Q-Table at the switch x is
updated by taking an average of the new estimated value (i.e. By + minQy (d,z)) and
an existing estimation (Qx (d,y)). Commonly, in Q-Routing models, the following
equation is used:
The efficiency of the LM and LNM methods are compared with the DBAR [15]
approach. DABR is an adaptive routing using local and non-local congestion
information. The performance of the DBAR approach is extensively discussed in
[15] and it is compared with XY, NoP [16] and RCA [17] methods. In this work,
the adaptivity of DBAR is similar to the DyXY routing algorithm. A wormhole-
based NoC simulator is developed with VHDL to model all major components
of the on-chip network and simulations are carried out to determine the latency
characteristic of each network. The message length is uniformly distributed between
5 Learning-Based Routing Algorithms for On-Chip Networks 121
a b
350
350 LNM
300 LNM 300
LM
250 LM 250 DBAR
200 DBAR 200
150 150
100 100
50 50
0 0
0 0.1 0.2 0.3 0.4 0 0.05 0.1 0.15 0.2 0.25
Injection Rate (flits/node/cycles) Injection Rate (flits/node/cycles)
Fig. 5.12 Performance evaluation in (a) 8 × 8 and (b) 14 × 14 mesh network under the uniform
traffic model
5 and 10 flits. The data width is set to 32 bits and each input channel has the
buffer (FIFO) size of 8 flits. The simulator is warmed up for 12,000 cycles and then
the average performance is measured over another 200,000 cycles. Two synthetic
traffic profiles including uniform random and hotspot, along with SPLASH-2 [18]
application benchmarks are used.
In Fig. 5.12, the average communication delay as a function of the average message
injection rate is plotted for 8 × 8 and 14 × 14 mesh networks. As observed from the
results, in low loads, the Q-Routing schemes (LNM and LM) behave as efficiently
as DBAR. As load increases, DBAR is unable to tolerate the high load condition,
while the Q-Routing schemes learn an efficient routing policy. LNM leads to the
lowest latency due to the fact that it can distribute traffic over both minimal and non-
minimal paths. In fact, in DBAR and LM, messages use minimal paths so that they
are routed through the very center of the network which creates permanent hotspots.
Correspondingly, messages traversing through the center of the network will be
delayed much more than they would use non-minimal paths. Due to the fact that the
LNM method can reroute messages, it alleviates the congested areas and performs
considerably better than other schemes. Using minimal and non-minimal routes
along with the intelligent selection policy reduces the average network latency
of LNM in an 8 × 8 network (near the saturation point, 0.3) about 34 % and 45 %,
compared with LM and DBAR, respectively.
a b
350 LNM 350 LNM
Average Latency (cycle)
Fig. 5.13 Performance evaluation in (a) 8 × 8 and (b) 14 × 14 mesh network under hotspot traffic
model with H = 10 %
Application traces are obtained from the GEMS simulator [19] using some applica-
tion benchmark suites selected from SPLASH-2 [18]. We use a 64-switch network
configuration, including 20 processors and 44 L2-cache memory modules. For the
CPU, we assume a core similar to Sun Niagara and use SPARC ISA [20]. Each
L2 cache core is 512 KB, and thus, the total shared L2 cache is 22 MB. The
memory hierarchy is governed by a two-level directory cache coherence protocol.
Each processor has a private write-back L1 cache (split L1 I and D cache, 64 KB,
2-way, 3-cycle access). The L2 cache is shared among all processors and split into
banks (44 banks, 512 KB each for a total of 22 MB, 6-cycle bank access), connected
via on-chip switches. The L1/L2 block size is 64B. Our coherence model is based on
a MESI-based protocol with distributed directories, with each L2 bank maintaining
its own local directory. The simulated memory hierarchy mimics SNUCA [21] while
the off-chip memory is a 4 GB DRAM with a 220-cycle access time. Figure 5.14
shows the average message latency across six benchmark traces, normalized to
DBAR. LNM provides lower latency than other schemes and it shows the greatest
performance gain in Ocean with 32 % reduction in latency (vs. LM). The average
performance gain of LNM across all benchmarks is up to 27 % vs. LM and 35 %
vs. DBAR.
5 Learning-Based Routing Algorithms for On-Chip Networks 123
DBAR LM LNM
Normalized 1
latency 0.8
0.6
0.4
0.2
0
Barnes cholesky FFT LU Ocean Radix
Fig. 5.14 Normalized latency under different application benchmarks normalized to DBAR
To assess the area overhead and power consumption of LNM, the whole platform of
each scheme is synthesized by Synopsys Design Compiler. Each scheme includes
switches, communication channels, and congestion wires. For synthesis, we use the
UMC 90 nm technology at the operating frequency of 1 GHz and supply voltage of
1 V. We perform place-and-route, using Cadence Encounter, to have precise power
and area estimations. The power dissipation of each scheme is calculated under the
hotspot traffic profile near the saturation point (0.18) using Synopsys PrimePower
in an 8 × 8 mesh network. The layout area and power consumption of each platform
are shown in Table 5.4. Comparing the area cost of the platform using LNM with
the platforms using LM and DBAR indicates that learning approaches consumes
more power and a higher area overhead than DBAR. The LNM platform consume
more average power because of rerouting messages around the congestion areas
which increases the hop counts. The results indicate that the maximum power of the
LNM is 7–13 % less than that of the LM and DBAR platforms, respectively. This
is achieved by smoothly distributing the power consumption over the network using
the highly adaptive routing scheme which reduces hotspots in NoCs. The maximum
power values, reported in the table, belong to the switch designated as the hotspot
one, (4,4).
5.6 Conclusion
condition and makes a routing decision based on this information. However, it will
not lead to an optimal performance. The reason is that in minimal routing, packets
are limited to at most two directions at each intermediate switch and thus they
cannot be well distributed over the network. To solve this problem, we introduced
a highly adaptive routing algorithm which provides a large degree of adaptiveness.
To find the less congested route among all non-minimal routes, the learning-based
approach is utilized. For this purpose, a table is needed at each switch. These tables
are relatively small and designed in a scalable manner. The experimental results
confirm the advantages of combining the non-minimal routing and learning-based
approaches over traditional methods.
References
14. C. Feng, Z. Lu, A. Jantsch, J. Li, M. Zhang, A reconfigurable fault-tolerant deflection routing
algorithm based on reinforcement learning for network-on-chip, in Proceedings of the Third
International Workshop on Network on Chip Architectures (2010), pp. 11–16
15. S. Ma, N. Enright Jerger, Z. Wang, DBAR: An efficient routing algorithm to support multiple
concurrent applications in networks-on-chip, in Proceedings of the 38th Annual International
Symposium on Computer Architecture (2011), pp. 413–424
16. G. Ascia, V. Catania, M. Palesi, D. Patti, Implementation and analysis of a new selection
strategy for adaptive routing in networks-on-chip. IEEE Trans. Comput. 57(6), 809–820 (2008)
17. P. Gratz, B. Grot, S.W. Keckler, Regional congestion awareness for load balance in networks-
on-chip, in Proceedings of IEEE 14th International Symposium on High Performance
Computer Architecture, HPCA (2008), pp. 203–214
18. S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, A. Gupta, The SPLASH-2 programs: Charac-
terization and methodological considerations, in Proceedings of 22nd Annual International
Symposium on Computer Architecture (1995), pp. 24–36
19. M.M.K. Martin, D.J. Sorin, B.M. Beckmann, M.R. Marty, M. Xu, A.R. Alameldeen, K.E.
Moore, M.D. Hill, D.A. Wood, Multifacet’s general execution-driven multiprocessor simulator
(GEMS) toolset. SIGARCH Comp. Archit. News 33(4), 92–99 (2005)
20. P. Kongetira, K. Aingaran, K. Olukotun, Niagara: A 32-way multithreaded Sparc processor.
IEEE. Micro. 25(2), 21–29 (2005)
21. B.M. Beckmann, D.A. Wood, Managing wire delay in large chip-multiprocessor caches, in
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
(2004), pp. 319–330
Part II
Multicast Communication
Chapter 6
Efficient and Deadlock-Free Tree-Based
Multicast Routing Methods
for Networks-on-Chip (NoC)
Abstract This chapter presents a new efficient and deadlock free tree-based
multicast routing method and concept. The presented deadlock-free multicast
routing algorithm can be implemented on a network-on-chip (NoC) router microar-
chitecture, realizing a mesh planar network topology. The NoC microarchitecture
supports both deadlock-free static and efficient adaptive tree-based multicast rout-
ing. Multicast packets are routed and scheduled in the NoC by using a flexible
multiplexing/interleaving technique with wormhole switching. The flexibility of the
proposed multicast routing method is based on a locally managed packet identity
(ID-tag) attached to every flit. This concept allows different packets to be interleaved
at flit-level in a single buffer pool on the same link. Furthermore, a pheromone
tracking strategy presented in this chapter, which is used to reduce communication
energy in the adaptive tree-based multicast routing method. The strategy is used to
perform efficient spanning trees for the adaptive tree-based multicast routing which
are generated at runtime.
6.1 Introduction
This chapter describes the use of an efficient multicast routing method for NoCs.
However, the use of the tree-based multicast routing method may lead to a multicast
dependency problem ending up in a deadlock, as presented in Fig. 6.1. This
multicast dependency can cause a multicast deadlock configuration (as described
in Duato’s Book [10]). In this case, multicast packets block each other and cannot
move further.
The deadlock problem occurs especially if packets switched with the wormhole
method or virtual cut-through switching are not short enough, there is not enough
buffer space to store the contenting wormhole packet, and/or arbitration rules are
not well organized to handle the multicast contention. In node (2,2) as presented in
the figure, packet A blocks the flow of two multicast branches of packet B (to west
and east), while in node (2,1), packet B blocks the flow of two multicast branches of
packet A (to west and to east). Due to the “wait and hold” situation in both network
switch nodes, both message A and B cannot move further.
In this chapter a NoC architecture called XHiNoC (eXtendable Hierarchical
Network-on-Chip) is presented, which proposes a novel wormhole cut-through
switching concept [26] based on a local identity-based (ID-based) interleaved
routing organization, in which the ID-tag of each packet is locally updated on
each communication link [23]. The XHiNoC’s concept has also exhibited a novel
multicast routing method [22, 24, 25] for NoCs based on the interleaved routing
organization with local Identity (ID) management. In the XHiNoC routers, the just
described deadlock configuration problem (due to multicast dependencies) is solved
efficiently by using a so-called “hold and release tagging mechanism”.
output port
input port
Packet A
Packet B (1,1) (2,1) (3,1)
Figure 6.2 illustrates, how five packets or data streams (a,b,c,d and e) can be
interleaved on each communication link. For instance, the ID-tag allocation of the
stream a sent from Sw4 to Sw3 via Sw5 and Sw2 can be seen. Its ID-tags are mapped
and allocated to ID slot 2 on Local input link of Sw4, ID slot 1 on West input link of
Sw5, ID slot 1 on North1 input link of Sw2, ID slot 1 on West input link of Sw3 and
at last, to ID slot 0 on Local output link of Sw3. Flits belonging to the same packet or
stream will have the same ID-tag on a specific link. Therefore, the ID-tag attached
on every flit enables each packet or streaming data flit to be switched to the correct
routed direction. In other words, the ID-tag represents the compressed form of the
routing direction made by the header flit when reserving communication resources.
6 Efficient and Deadlock-Free Multicast Routing Method for NoCs 133
d 3 c 3
0 0
Sw4 2 Sw5 1 Sw6
1
0 a 2 1 2 0 b 1 3 e 1
e 2 b 0 c 0 1 d 0 a
Each streaming data flit can extract the required routing direction from the routing
table that has been indexed in accordance with the local ID-tag of the packet stream.
The local ID-tag attached on each flit of the packet (see Fig. 6.5) is updated and
dynamically changed once a flit is switched to a new outgoing port. The ID update
is made by an ID management unit located at every output port. Figure 6.2 shows the
detailed view of the link sharing between stream a, b and c as presented in Fig. 6.2.
The communication resource (link) connecting South1 output port of Switch 5
(Sw5) and North1 input port of Switch 2 (Sw2) assigns packet stream a, b and c with
ID-tag 1, 0 and 2, respectively. Tables presented on the right side of the switches are
the ID-slot table of the South1 output port at Sw5 and the routing table (LUT) of
the North1 input port at Sw2. The content of the ID-slot table represents the ID-tag
mapping function of each packet stream. The content of the routing table represents
the routing directions to keep the correct routing tracks for each flit of the interleaved
packet streams.
Figure 6.2 shows how the ID-tag of a stream header coming from NORTH port
with ID-tag 3 is updated after being switched. The ID update process working as
follows: When the IDM detects a new incoming stream or packet header, then
it searches for a free ID slot on the output link by checking the ID-slot state in
the corresponding table. In the example case, the ID-tag 2 is free. The ID is then
assigned as the new ID-tag on the next link segment. The ID-slot 2 is indexed based
on the associated incoming local ID-tag 3 and the incoming direction (NORTH).
Hence, a data flit following the packet/stream at any instant time coming from
NORTH port with ID-tag 3 will be assigned also with the new ID-tag 2 in the
outgoing SOUTH port. In the same phase, the ID-tag 2 flag is set from “free” to
“used” state, and the number of used IDs (UID) is incremented. When the UID has
134 F.A. Samman and T. Hollstein
0 0
free M M 0
Number of used
nid=2
id−tag (nid) is
set from 2 to 3 Head 2 word
reached the number N of available ID slots, then “empty free ID flag” is set. When
a tail flit (the end of stream data) is passing through the outgoing port, then the state
of the related ID-tag 2 state is set from “used” to “free”, the UID is decremented
and the information related to this ID-number is then deleted from the ID Slot Table
concurrently (Figs. 6.3 and 6.4).
More information about the dynamic ID management can be found in detail in
[22–24, 26].
6 Efficient and Deadlock-Free Multicast Routing Method for NoCs 135
The packet format used in the XHiNoC, supporting the ID-based routing organiza-
tion, is depicted in Fig. 6.5. A multicast packet consists of a number Nh f of header
flits (which is equal to the number of multicast destinations Ndest ), any amount of
payload flits and a tail flit. The flit flow control field of every data flit consists of
3-bit flit type header and a 4-bit packet ID (Identity). Therefore in a 32-Bit instance
of the system, each flit of the packet has 39-bit width, i.e. 32-bit data word plus 7-bit
control field. The type can be header, data body, and the end of databody (last/tail
flit) as shown in the Fig. 6.5. Flits belonging to the same message will always have
the same local ID label on one and the same communication link. The ID number
attached in each flit will vary over different communication links to support a wire-
sharing concept with flit-level message interleaving. This concept scales well with
increasing mesh network sizes.
Definition 1. A multicast message (even if the size is very large) is not divided
into several sub-packets. Therefore, when N p f number of payload flits that will be
sent to Ndest number of destination nodes, the size of the multicast message will be
Nh f + N p f + 1 tail f lit.
By for instance using the West-First routing algorithm, the adaptivity of the
multicast-tree in the West direction is limited by the South–West and North–West
prohibited turns, where multicast adaptive tree-based routing cannot be made if the
136 F.A. Samman and T. Hollstein
X+
3 16 20
17 18 19 subnet
y address
2 11 12 13 14 15
X−
8 subnet
1 6 7 9 10
0 2 3 4 5
1
0 1 2 3 4
x address
destination addresses are located in the South-West and North-West quadrant area
[24]. These prohibited turns must be implemented in the routing algorithms to avoid
the occurrence of a deadlock configuration. In order to cover such problems, a planar
2D NoC architecture with mesh topology is also presented in this chapter. The NoC
is divided into two sub-networks in order to increase the degree of adaptivity of the
routing functionality. A planar adaptive routing algorithm has been firstly introduced
in [6], in which virtual channels (VCs) are introduced to support adaptive routing
and to couple the sub networks. The main difference of the presented approach is
that, instead of using VCs, we replace them with a double physical communication
link to increase the link and switch bandwidth capacity.
Figure 6.6 shows an example of the 2-D mesh 5 × 5 network. The Network-
on-Chip is physically divided into two subnetworks i.e., X+(depicted in solid line
arrows) and X−subnetworks (depicted in dashed line arrows). If the x-distance
between source and target nodes (xo f f s = xtarget − xsource ) is zero or positive, then
packets will be routed using the X+subnetwork. If xo f f s is zero or negative, then
the packets will be routed through the physical channels of the X−subnetwork.
The ports connected with vertical y-direction links of X+ and X−subnetworks are
denoted by (North1, South1) and (North2, South2), respectively. The packets being
routed through the X+subnetwork will have adaptivity to make West–North1, West–
South1, North1–East and South1–East turns as well as West–East, North1–South1
and South1–North1 straightforward (non-turn) routing direction. The packets being
routed through the X−subnetwork will have adaptivity to make East–North2, East–
South2, North2–West and South2–West turns as well as East–West, North2–South2
and South2–North2 straightforward routing directions.
The planar adaptive routing technique on a mesh topology has been firstly
introduced in [6] and is deadlock-free by principle. Instead of using virtual channels
to implement the interconnects between NORTH and SOUTH port as made in [6],
6 Efficient and Deadlock-Free Multicast Routing Method for NoCs 137
In this chapter, we will evaluate different models of NoCs with mesh topology by
using four mesh prototypes with different routing algorithms. The first model (‘plnr’
prototype) uses a 2D multicast planar adaptive routing algorithm that is presented in
Algorithm 4 and is implemented based on the planar mesh topology. The adaptive
routing decision is made based on the number of used ID slots (or the number of
free ID slots) on each possible alternative routing direction and the record of the
routing path made by other headers of the same message, which is later explained in
Sect. 6.5.3 and presented in Algorithm 2. The routing algorithm is divided into two
subalgorithms for X+ and X−subnetworks.
The second prototype (‘xy’ prototype) uses a static XY routing algorithm in the
standard mesh topology, where messages are first routed in horizontal X-direction
and then to vertical Y-direction. Hence, North–East turn and North–West as well
as South–East and South–West are prohibited in the static XY routing algorithm.
The remaining two models (‘wf-v1’ and ‘wf-v2’ prototypes) use the minimal West-
First routing algorithms. In the West-First routing algorithm, packets will always be
firstly routed to West direction, if its destination nodes are located in western area
relative from its source node or current position. When the destination nodes are
located in eastern area of its source node or current position, then the packet can be
routed adaptively to East, North or South directions [10].
REB REB
r a r a
a(N,N)
1 N
......
......
......
...... a(N,N)
r(1,1) r(1,1)
r(1,2) r(2,1)
a(1,1) a(1,1)
a(1,2) a(2,1)
r(1,N) r(N,1)
r(N,1) r(1,N)
r(N,2) r(2,N)
a(1,N) a(N,1)
a(N,1) a(1,N)
a(N,2) a(2,N)
r(N,N) r(N,N)
crossbar
inter−
connects
......
......
......
......
1 2 ... N 1 2 ... N
r a r a
MIM A MIM A
sel
sel
in Fig. 6.7. Each module is modeled based on generic code, which is strongly related
to the number of input-output connectivities of each port. The components located at
input ports are FIFO buffers and a Routing Engine with Data Buffering (REB). The
components located at output ports are an Arbiter (A) and a Crossbar Multiplexer
with ID-Management Unit (MIM). The working principles and mechanisms of the
Arbiter, REB and MIM units are explained in detail in [22] and [24].
Figure 6.8 shows the components in an incoming port of the XHiNoC router. In
the REB module, there are a Grant-Multicasting Controller (GMC), a Routing En-
gine (RE), a Route Buffer and a Read-Logic Unit (RLU). The GMC consists of
combinatorial logic and is used to control the acceptance of the multicast routing
acknowledge (grant) signals from output ports. The RLU is also a combinatorial
logic, which is used to control the read-operation of a flit from the FIFO buffer into
the Route Buffer. When a routing direction for a flit is being decided by the RE
unit, then this flit will be concurrently buffered in the Route Buffer. This concurrent
step is introduced in order to reduce the number of internal pipeline stages in the
XHiNoC router, and it can improve the router performance accordingly. The RE
unit consists of the Routing State Machine (RSM) and the Routing Reservation Table
(RRT), which consists of a number of H preservable routing slots.
The design of the XHiNoC routers can be fully parametrized and customized on
demand. Each VHDL entity contains generic code, which enables the derivation of
new VHDL modules with a specific architecture and a number of input/output pins
according to the specification. The custom-generic modular-based design approach
also enables us to easily generate irregular NoC topologies.
Before running a real experiment, the different routing paths that will be
performed by the aforementioned tree-based multicast routing methods (except for
the wf-v1 multicast method) are shown in Fig. 6.9. A multicast message is injected
from node (2,2) to 10 multicast destinations. The ‘xy’ multicast router performs
24 link usages in the NoC. While the ‘plnr’ and ‘wf-v2’ multicast routers perform
6 Efficient and Deadlock-Free Multicast Routing Method for NoCs 139
FIFO
Full
enW
enR
empty
grantR
Dest.
enW
ID
RLU Type
r
RE GMC
Route Buffer
busy
r a
2,5
2,4
2,3
2,2
2,1
2,5
2,4
2,3
2,2
2,1
1
Type ID Dest.
...
0 1 2 3 4 5 6 H−1 Routing
State
Machine
...
...
0 1 2 3 4 5 6 H−1
Routing
Reservation Databody
Table (RRT) Tail
Header
2,5
2,4
2,3
2,2
2,1
Fig. 6.8 Input port of a 5-port router (for the static XY routing, the signal paths of the number of
used ID slots are removed)
with only 19 and 21 link segment usages in the NoC respectively. The traffic metric
is defined as the number communication links used by a multicast packet to travel
from a source node to destination nodes. From this example it can be concluded,
that the planar adaptive multicast router can potentially reduces the communication
energy of the multicast data transmission. The following experiment will show us
the result of a more complex data distribution scenario.
4 Destination
Node
3 Source
y−Address
Node
Static XY
Tree−based
2 Multicast
Adaptive WF−v2
Tree−based
1 Multicast
Planar (plnr)
Adaptive
0 Multicast
0 1 2 3 4 5
x−Address
Fig. 6.9 The traffic patterns by using static tree-based, minimal adaptive west-first and minimal
planar adaptive multicast routing methods
The problem of inefficient runtime spanning tree configuration that can affect the
overall throughput of the generated multicast trees [24]. Because of this uncovered
issue, in any circumstance, the adaptive routing cannot show better performance,
since the data rate of the multicast tree depends on the slowest data rate in all
spanning trees or branches of the multicast tree. Therefore, based on the presented
local ID management concept, an efficient method for runtime multicast spanning
tree configurations by using a minimal adaptive routing algorithms based on a so-
called pheromone tracking strategy is proposed in this contribution. Minimizing the
size of spanning trees (total multicast communication traffic) will not only reduce
the communication energy but also decrease the probability of forming spanning
trees having slower data rates.
The routing engine (RE) units in XHiNoC consist of combination of a Routing State
Machine (RSM) unit and a Routing Reservation Table (RRT) unit. The combination
is aimed at supporting a runtime link interconnect configuration.
6 Efficient and Deadlock-Free Multicast Routing Method for NoCs 141
where k ∈ Ω = {0, 1, 2, · · · , Nslot − 1}, Nslot is the numberof ID slots on a link. rdir
is a routing direction, where rdir ∈ D = 1, 2, 3, · · · , Nout p , and Nout p is the number
of I/O ports in a router.
Definition 3. A Routing Reservation Table (RRT) of the RE unit at an input port of
a multicast router is defined as
T (k) = [Tmcs (k, 1) Tmcs (k, 2) · · · Tmcs (k, Nout p )] (6.2)
T (k) contains a binary-element vector. Hence, T has 2D (matrix) size of row ×
column = Nslot × Nout p .
Based on Definitions 2 and 3, a binary-encoded multicast routing direction
bin = enc(r ) is introduced and has a size of N
rdir dir out p number of binary elements.
For example, if Nout p = 5, then enc(1) = [1 0 0 0 0], enc(2) = [0 1 0 0 0],
enc(3) = [0 0 1 0 0], enc(4) = [0 0 0 1 0] and enc(5) = [0 0 0 0 1]. Algorithm 1
describes the ID-based Routing Organization between the RSM and the RRT unit. If
a RE unit identifies a flit F(type, I) as a header flit (type = header) from the output
of a FIFO buffer with local ID-tag I ∈ Γ, where Γ = Ω then the routing function of
RSM unit fRSM (Adest ) will determine a routing direction rdir . This means that the
rdir is calculated logically based on destination address Adest attached in the header
flit and current address of the router, and the routing direction is written in the slot
number k = I of the RRT unit. In the next time periods, when the RE units identify
payload flits with the same ID-tag number (ID-tag number I) with the previously
forwarded header flit, then their routing direction will be taken up directly from the
slot number k = I in the RRT unit.
Branch B
R1 R2
Branch B
There are three main steps to send a multicast message towards multiple
destinations. The first step is to forward all header flits for the multicast tree routing
setup and ID-slot reservation. The second step is to multicast (replicate) the payload
flits to follow the path set up previously by the header flits. The last step is to set
free the reserved local ID-slot by the tail flit. The detail procedure can be found in
[24] and is formally described in Algorithm 1.
Definition 4. A runtime tree-based multicast routing configuration of a message
that will be sent to a number of Ndest multicast destinations is established by sending
Nh f number of header flits, where Ndest = Nh f . The multicast header flits H j (I), j ∈
1, 2, · · · , Nh f can be ordered arbitrarily, where I is the ID-tag of the headers at a
certain (input) link n. Thus, we can further define that Fn (header, I) = H j (I).
Branch A
Branch B
R4 R4
Branch A Branch B
c d
Branch A Branch B
R4 R4 Branch A
Branch B
In the router R3 in Fig. 6.10, the multicast message is routed from SOUTH
to EAST direction (branch A), while in the router R2, the multicast message is
routed from WEST to NORTH direction (branch B). Finally these two branches
are then routed to the same router (router R4). In this case, the multicast tree
branches (spanning trees) are inefficient in term of communication energy. The
communication energy can be reduced if the router R1 performs only the multicast
tree branch A or branch B.
Postulate 2. If two header flits (H j (I) and Hk (I)) having the same ID-tag I (hence,
belonging to the same multicast message) are routed from the same input port n
in a router node (x, y) at two consecutive times tH j and tHk where tH j < tHk (which
means that H j is routed firstly before Hk ), then an inefficient runtime spanning tree
configuration can happen when Hk (which will be routed to destination node (xk , yk ),
where |xo f f s,k | = |xk − x| > 0 and |yo f f s,k | = |yk − y| > 0) does not follow to an output
port m which has also been selected previously for routing of H j (having destination
node (x j , y j )), where x j − x = xo f f s, j = xo f f s,k and y j − y = yo f f s, j = yo f f s,k or
xo f f s, j = 0 and yo f f s, j = yo f f s,k or xo f f s, j = xo f f s,k and yo f f s, j = 0.
Figure 6.11 depicts four possible situations that can occur in the router R4 as
the further disadvantageous consequences of the inefficient multicast spanning trees
configured from Fig. 6.10. These situations can happen since the number of free ID
slots on each communication link as the parameter of the adaptive routing algorithm
may change dynamically. Figure 6.11a, b show a tree-branch crossover problem, in
which the inefficient spanning trees are propagated through different outgoing ports.
If we assume that the current address of router R4 is (xcurr , ycurr ) and the target
nodes of the tree branches A and B are (xt1 , yt1 ) and (xt2 , yt2 ) such that xo f f set1 =
xt1 − xcurr > 0 and yo f f set1 = yt1 − ycurr > 0 as well as xo f f set2 = xt2 − xcurr > 0 and
yo f f set2 = yt2 − ycurr > 0, then in any circumstance, the inefficient situation might
happen again in the next intermediate nodes.
144 F.A. Samman and T. Hollstein
The problematic configurations presented in Figs. 6.10 and 6.11 are not only
inefficient in terms of communication energy (because the inefficient traffic will
overburden the NoC), but also in term of communication latency, since the
inefficient traffic can degrade the data rate of the multicast traffic. The problems
reduce the NoC performance while increasing power consumption.
We solve the aforementioned problem not by designing a specific multicast
path optimization algorithm that should be run at compile time or before injecting
multicast messages (pre-processing algorithm). Compile-time path optimization
algorithms such as the optimal spanning tree algorithm are suitable for source
routing approach, where the routing paths for the overall pathes of a multicast
message from source to destination node are determined at source node before the
message is injected to the network. In contrast, in XHiNoC the routing algorithm
used to route unicast and multicast messages is the same, the routing decisions are
made at runtime and locally executed hop-by-hop on every port of each router. Thus,
we do not follow the approach of a path pre-processing optimization algorithm for
the sake of initiation-time-efficiency.
In order to avoid the previously described problems, each time a routing engine
has two alternative output ports for making a routing decision, then a new adaptive
output selection strategy between two alternative output ports will be applied. A
simple abstract view of the adaptive selection strategy is outlined in Algorithm 2.
The basic concept of the proposed algorithm is the identification of the track records
(pheromone trails) of other previously-routed header flits that belong to the same
multicast message. This concept is designed in order to avoid inefficient spanning
trees branches of the multicast tree.
Definition 5. A pheromone trail checking is an operation to check the binary state
of the multicast routing slots Tmcs (k, m1 ) and Tmcs (k, m2 ). The operation is made by
a header flit H j (I) having ID-tag number I = k that will be alternatively routed to
output directions m1 and m2 , where m1 , m2 ∈ D.
The hardware implementation of the efficient adaptive routing selection function
with pheromone tracking strategy will not result in a more complex operation in
the XHiNoC microarchitecture. The router complexity can be reduced naturally
6 Efficient and Deadlock-Free Multicast Routing Method for NoCs 145
Header 1
of packet A
Alternatif
tree 1
usedID=1 or
North1 North2
usedID>1
West
(1,2) (2,2) (3,2)
?
Alternative
2D Planar
Header 2 ? tree 2
Router
of packet A
East
Pheromone of
packet A made usedID=0
by Header 1
South1 South2
(1,1) (2,1) (3,1)
Input Port
Output Port
In this section, four XHiNoC multicast router prototypes with different multicast
routing algorithms are compared. The first prototype is a multicast router working
with a planar adaptive routing algorithm on a NoC mesh architecture, which is
presented with ‘plnr’ acronym in the figures. The second prototype applies a
static XY multicast routing algorithm on the mesh standard architecture (‘xy’). The
third and fourth prototype are multicast routers in a standard mesh architecture
applying an adaptive West-First (WF) routing algorithm (‘wf-v1’ and ‘wf-v2’).
The adaptive WF multicast router version 1 (‘wf-v1’) is a multicast router without
implementation of an adaptive selection strategy to avoid an inefficient spanning
tree (branches of the multicast tree) as presented in [24]. Thus in this prototype, the
multicast trees are formed freely without considering the track records of the other
previously-routed header flits belonging to the same multicast group. The adaptive
WF multicast router version 2 (‘wf-v2’) implements the adaptive selection strategy
presented in the Algorithm 2 to avoid an inefficient spanning tree problem.
148 F.A. Samman and T. Hollstein
57 16 58 2 59 31 60 25 61 34 62 56 63 2 64 16
7
U M6 U
49 2 50 56 51 40 52 40 53 25 54 16 55 37 56
6
M8
41 56 42 16 43 16 44 34 45 31 46 58 47 37 48 2
5
33 40 34 37 35 2 36 8 37 1 38 16 39 34 40 25
4
M6 M6 M6
25 31 26 56 27 16 28 64 29 57 30 56 31 25 32 34
3
M6 M6
17 56 18 37 19 40 20 25 21 34 22 2 23 58 24 56
2
9 58 10 40 11 31 12 31 13 25 14 56 15 2 16 37
1
M8
1 16 2 31 3 58 4 34 5 40 6 37 7 58 8 2
0
U M8 U
0 1 2 3 4 5 6 7
Every NoC router in Fig. 6.13 is depicted with a square block with numerical
attributes. A numerical symbol in the small square block at the top-left side of a
NoC router node represents the node number. The numerical symbol at the top-right
side in the NoC router node represents the communication partner of the node, from
which the NoC router node will receive a message. For example, the network node
at node address (2, 1) (2D node address, 2 is the x-horizontal address and 1 is the
y-vertical address) has node number 11. At the top-right side in this node, we see
the numerical value 31 (this means that the node will receive a packet from mesh
node 31 located in the node address (6, 3)).
The boldface symbols (U, M6 and M8) at the bottom-left corner of the router
box represent that the network node will send a unicast message (U) or a multicast
message with a number of 6 target nodes (M6) or 8 target nodes (M8). For example,
the mesh node at address (7, 1) (mesh node number 16) is attributed with (M8).
This means that the node will send a multicast message into 8 destination nodes.
We can find the target nodes of the multicast message sent from the mesh node
number 16 by identifying mesh nodes having numerical symbol 16 at the right-
side in the router box. In order to find easily the partners of each unicast and
multicast communications, Table 6.1 gives an overview on the unicast and multicast
communication partners/groups of the source-destination distribution presented in
Fig. 6.13.
150
Table 6.1 Unicast and multicast communications for the random multicast test traffic scenario
Comm. group Type Source Targ. 1 Targ. 2 Targ. 3 Targ. 4 Targ. 5 Targ. 6 Targ. 7 Targ. 8
Comm. 1 M6 (1,4) (6,4) (7,3) (4,2) (3,0) (3,5) (4,7) – –
Comm. 2 M6 (7,4) (0,4) (3,6) (2,6) (2,2) (1,1) (4,0) – –
Comm. 3 M6 (0,3) (6,3) (3,2) (4,1) (7,4) (4,6) (3,7) – –
Comm. 4 M6 (6,3) (0,3) (4,5) (3,1) (2,1) (1,0) (2,7) – –
Comm. 5 M6 (1,7) (7,6) (5,5) (6,2) (6,0) (2,0) (0,1) – –
Comm. 6 M6 (4,4) (1,4) (1,2) (7,1) (5,0) (6,5) (6,6) – –
Comm. 7 M8 (1,0) (7,0) (6,1) (5,2) (2,4) (7,5) (6,7) (0,6) (1,7)
Comm. 8 M8 (7,1) (2,3) (5,4) (2,5) (1,5) (5,6) (0,7) (7,7) (0,0)
Comm. 9 M8 (7,6) (1,6) (0,5) (5,3) (1,3) (0,2) (5,1) (5,7) (7,2)
Comm. 10 U (0,0) (4,4) – – – – – – –
Comm. 11 U (0,7) (4,3) – – – – – – –
Comm. 12 U (7,0) (3,4) – – – – – – –
Comm. 13 U (7,7) (3,3) – – – – – – –
F.A. Samman and T. Hollstein
6 Efficient and Deadlock-Free Multicast Routing Method for NoCs 151
a
950 plnr
b
52000
Tail flit accept. (clk cycles)
50000
48000
46000
44000
42000
40000
38000
36000
34000
32000
30000 plnr
28000 xy
26000 wf−v2
24000 wf−v1
22000
0.100 0.125 0.200 0.250 0.500
Injection Rate (num. of flits/cycle/producer)
Fig. 6.14 Average bandwidth and tail flit arrival latency measurement versus expected data
injection rates for multicast random test scenario. (a) Average bandwidth. (b) Average tail flit
latency
The measurements of the average bandwidth and tail flit acceptance latency with
various expected data injection rates are depicted in Fig. 6.14. The measurements are
made for five different expected data injection rates, i.e. 0.1, 0.125, 0.2, 0.25 and 0.5
flits/cycle (fpc) where each source node injects a 5000-flit packet (equivalent with
4 × 5,000 = 20 kB data words). It seems that for all multicast routing algorithms, the
average BW increases as the injection rate is increased. However, the average BW
will tend to saturate, when the injection rate is set nearly to maximum injection rate
(the maximum injection rate is 1 flit/cycle).
152 F.A. Samman and T. Hollstein
Table 6.2 Total performed traffic on each link direction for different tree-
based static and adaptive multicast routing methods
Routers South2 North2 South West North East TOT.
plnr 30 28 34 65 27 46 230
xy – – 83 40 89 36 248
wf-v2 – – 79 40 80 46 245
wf-v1 – – 85 40 81 86 292
Based on measurements with 1 GHz data frequency, each link has a maximum
bandwidth (BW) capacity of 2,000 MB/s. The number of clock cycles required
to transfer the 20 kB data words to a target node j is measured by counting the
number of clock cycles until receiving the tail flit of the packet (i.e. the 5,000th
j
or the last flit), which is defined as NTC . Since there are 64 target nodes in the
traffic scenario shown in Fig. 6.13, the average tail flit acceptance values plotted in
j
Fig. 6.14b are 641
∑64
j=1 NTC . The BW measurement on a target node j is calculated
as B j = (NTC )−1 × 20, 000 B. Consequently, the average actual BW values plotted
j
1
in Fig. 6.14a are 64 ∑64
j=1 B j . It can be seen, that the tree-based multicast router with
planar adaptive routing algorithm shows the best performance both in terms of the
average bandwidth (Fig. 6.14a) and the average latency of the tail flits acceptance
(Fig. 6.14b) for all expected data rates. If the data injection rates are very low
(e.g. 0.1 fpc), then the performance of the multicast routers will be the same. The
performance of the planar adaptive multicast router will be significantly better
compared to the other multicast routers, when the data is expected to be transmitted
with higher data rates.
Table 6.2 shows the comparisons of the total performed communication traffic for
the four tree-based multicast router prototype scenarios. The number of the traffic
represents the number of communication resources (communication links) being
used to route the unicast/multicast message from source to destination nodes. The
metric of the traffic number is measured by counting the number of used/reserved
ID slots at all router output ports at peak performance. This performance metric
(the number of traffic) can also be used as measurement unit for the communication
energy of the evaluated multicast routers. This metric is also interesting for data
comparison with other works in the future, since it is technology-independent.
As shown in Table 6.2, it seems that the planar adaptive multicast router (‘plnr’)
consumes less communication resources compared to the other multicast routers,
i.e. about 230 communication links followed by the west-first adaptive multicast
routing with the efficient spanning tree method (‘wf-v2’), and then the static tree-
based multicast router that uses XY routing algorithm (‘xy’).
6 Efficient and Deadlock-Free Multicast Routing Method for NoCs 153
a 14
13 plnr
xy
12 wf−v2
11 wf−v1
Reserved ID Slots
10
9
8
7
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Network node
b 14
13 plnr
xy
12 wf−v2
11 wf−v1
Reserved ID Slots
10
9
8
7
6
5
4
3
2
1
0
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
Network node
Fig. 6.15 Reserved (used) overall ID slots for multicast random test scenarios. (a) Node 1–32.
(b) Node 32–64
In order to see the traffic configurations in the network in detail, Fig. 6.15 shows
the 2D view of the total ID slot reservations in every NoC router node. Figure 6.15a
depicts the total reserved ID slots for the NoC router node 1 until node 32, while
Fig. 6.15a shows the total ID slots reservation for the NoC router node 33 until
node 64. As depicted in Fig. 6.15, the west-first adaptive routing algorithm without
the efficient spanning tree method (‘wf-v1’) reserves more ID-slots than the other
routing algorithms at several router nodes.
154 F.A. Samman and T. Hollstein
One basic class of methods for routing multicast messages in mesh-based NoCs are
tree-based multicast routing techniques. In tree-based multicast routing, a header
ordering before submission of the packet within the source node is not required (the
order of the destination addresses can be freely determined). The multicast routing
will form communication paths like branches of trees connecting the source node
with the destination nodes at the end points of the tree branches. The work in [4,19]
and [14] have presented the concept and methodology to route multicast messages
by using tree-based methods, which has been utilized in general internetworking
context. The work in [25] has presented a new theory for deadlock free tree-
based multicast routing for networks-on-chip area (mesh topology). The theory is
developed based on a dynamic local ID-tag routing organization and the concept of
a hold-release tagging mechanism.
Alternatively, multicast messages can also be routed by applying path-based
multicast routing methods. Here, as a prerequisite, a multiple target ordering is
required before the multicast packets are sent to the network. The path-based
multicast routing requires a full implementation of an adaptive routing algorithm
allowing all turns in the mesh-based network topology. Therefore, virtual channels
are usually needed to make a deadlock free multicast routing function. Virtual
channels in the context of on-chip interconnection network will consume not only
larger logic gate area but also larger power dissipation. The works in [9, 15, 16]
and [5] have presented the path-based multicast routing methods for a mesh-based
network topology.
In the Network-on-Chip (NoC) research area, some multicast NoC architectures
have been introduced. Most of them use the tree-based multicast routing method
[1, 11, 12, 28], and path-based multicast routing method [8, 13, 18, 21]. The virtual
circuit tree multicast (VCTM) NoC [12] for example has presented a NoC that uses
virtual circuit tree numbers to configure routing paths. However, compared to the
presented approach, which uses runtime dynamic local ID configuration, the VCTM
NoC applies a static method, where virtual circuit tables are statically partitioned
among nodes.
A few NoC environments also proposed specific multicast routing methodologies
such as a closed-loop path routing method [17] and a region-based routing method
[21]. The recursive partitioning multicast (RPM) NoC [28], as another example,
applies a recursive hop-by-hop network partitioning method to multicast packets at
each intermediate node. The packets in the RPM NoC make replication at a certain
node to multicast packets. The replicated packets will update the destination list
attached on their header flit and make a new network partitioning recursively based
on their current position. By using such scheme, the RPM method will increase the
complexity of the routing computational logic.
Table 6.3 presents several NoCs that propose and provide multicast routing
services for packet routing. Most of the NoC approaches route the network packets
Table 6.3 State-of-the-arts of multicast routing techniques for NoCs
Multicast Switching Routing VC buffers, Logic area
method method adaptivity (buffer depth) (technology) Specific features
VCTM [12] Tree-based Circuit Adaptive, ja. 4,8 (d.n.a) 0.0240d mm2 Virtual circuit tree, static
switching static (70 nm) VC-table partitioning
RPM [28] Tree-based Wormhole d.o. target ja. 4 (4) 4.0e mm2 Recursive partitioning,
distribution (65 nm) priority-based replication
LDPM [8] Path-based Wormhole Odd-even jaa . 4 (8) 21,050 gates Path-based with optimized
adaptive (0.25 μm) destination ordering
bLBDR [21] Region-based Wormhole Static, adaptive d.n.a (d.n.a) 0.0499 mm2 Traffic isolations, network
ext. (90 nm) domain partitioning
MRR [1] Tree-based Virtual Adaptive no. (20,20,10)b d.n.a Extra internal ring buffers (rotary
cut-through router)
OPT, LXYROPT [11] Tree-based Circuit Adaptive ja. 4 (3) d.n.a Pre-processing algorithm for tree
west-first generation
VC-A/D-FD [13] Path-based Wormhole, Static, adaptive ja. 4 (8/10)c 1,172.03f M λ 2 FIFO with address/data
VCT (65 nm) decoupling
Custom Mcast [29] Tree-based Packet Static no. (4) 0.18–3.06 mm2 Multicast routing at design-time
(70 nm) (static)
COMC [18] Path-based Wormhole Static ja. 4, 6 (2) d.n.a Connection-oriented path-based
multicast
TDM-VCC [17] Closed-loop Circuit Static no. (d.n.a) d.n.a Pre-processing algorithm for
6 Efficient and Deadlock-Free Multicast Routing Method for NoCs
f The area is for adaptive router with wormhole switching (no further explanation about the λ unit)
156 F.A. Samman and T. Hollstein
by using wormhole switching method. The table compares some aspects regarding
router implementation and their specific features. Some information cannot be
provided in the table, because the data is not available from the considered
publications in the bibliography. As shown in the table, compared to the other
NoCs, the XHiNoC can be implemented with a very small size buffer (single buffer
with only two data slots per input port). The FIFO Buffer is a NoC component that
can consume relatively large logic area compared to other NoC components. It can
also have a large power dissipation due to intensive switching activities of data that
occurs in the buffer.
In Multiprocessor Systems-on-Chip (MPSoC) and chip-level multiprocessor
system (CMP) applications, multicast communication services are an important and
essential issue. Recent works related to NoC-based multicast communication are
presented in [29] and [17]. The work in [29] presents the problem of synthesizing
custom NoC architectures that are optimized for a given application (which is
critical, when dependability aspects are considered as well). The there presented
multicast method considers both unicast and multicast traffic flow in the input
specification. But the work proposes a static solution for deadlock-free multicast
routing that is fixed to specific NoC applications, i.e. the applications must be
known before chip fabrication. The work presented in [17] proposes a TDM (Time
Division Multiplex)-based virtual circuit configuration (TDM-VCC) where a pre-
processing algorithm for time slot allocations is made before injecting multicast
messages into the NoC. In some specific embedded system applications, the inter-
core communication patterns are known. Therefore, a pre-processing static routing
for congestion avoiding techniques can be used [20, 29], expectedly resulting in a
much simpler router architecture (pre-manufacturing routing technique). Runtime
dynamic adaptive routing methods [2, 3, 22] are however an interesting approach
in the NoC-based multicore embedded systems, where applications may not be
known in advance and dependability and reconfiguration plays a role. Indeed, some
embedded IC vendors in the multicore era could potentially not only market IP cores
but also system architectures [7], where many applications can be mapped onto the
system architectures product (IP+NoC cores). Therefore, the implementation of the
runtime dynamic adaptive tree-based multicast routing will simplify an embedded
system design flow because the routing information configuration is not needed
anymore on the post-manufacture (on-chip) router. In this context however, the
runtime techniques will need extra area cost and complexity.
6.8 Summary
In general, NoC routers applying planar adaptive routing schemes can achieve an
improved performance because of the higher bandwidth capacity of the NoC in
double vertical links connecting NORTH and SOUTH ports. Nevertheless, this
performance gain must be paid by logic and routing area overhead to implement
the mesh planar router architecture.
6 Efficient and Deadlock-Free Multicast Routing Method for NoCs 157
The tree-based multicast routing presented in this chapter belongs to the class of
runtime distributed routing techniques, in which routing decisions are made locally
during application execution time (runtime) on every router/switch node based on
a header’s destination address. The advantage of this method is, that it scales well
with increasing NoC sizes. Hence, the presented technique to prevent the inefficient
spanning tree problems has been proven to be feasible and deliver good results.
Compared to static methods it will probably result in a suboptimal or near-optimal
multicast spanning tree, but in certain cases, a global optimal spanning tree may be
attained.
When a static tree-based multicast routing would be used, then the configured
multicast spanning trees will always be the similar, although the order of the header
probes is changed. For a fixed traffic scenario, the global optimal multicast spanning
tree could be attained by finding an optimum ordering of the header flits in one
multicast message. The optimum ordering is strongly dependent on the multicast
traffic patterns. Such a procedure would require the computation of an optimum
ordering algorithm before injecting multicast packets at source nodes. However, the
supplement effort will lead to extra computational power and delay, due to the extra
pre-processing at the source nodes. Furthermore, the result can also be sub-optimal,
since the traffic situation in the network can vary.
This issue has exactly been addressed within the presented approach, which
considers local traffic situations dynamically and minimizes the communication re-
source and energy usage. When the presented adaptive routing algorithms are used,
then the spanning tree is formed independently at runtime by header probing and
additional consideration of the local traffic situation. The spanning tree topology can
vary with the order of header flits in the multicast message and with indeterministic
dynamically varying traffic loads in the network.
References
1. P. Abad, V. Puente, J.-A. Gregorio, MRR: enabling fully adaptive multicast routing for CMP
interconnection networks, in Proceedings of the 15th IEEE International Symposium on High
Performance Computer Architecture (HPCA 2009), Shanghai, 2009, pp. 355–366
2. M.A. Al Faruque, T. Ebi, J. Henkel, Run-time adaptive on-chip communication scheme, in
Proceedings of the 2007 IEEE/ACM International Conference on Computer-Aided Design
(ICCAD’07), San Jose (IEEE Press, Piscataway, 2007), pp. 26–31
3. G. Ascia, V. Catania, M. Palesi, D. Patti, Implementation and analysis of a new selection
strategy for adaptive routing in networks-on-chip. IEEE Trans. Comput. 57(6), 809–820 (2008)
4. M. Barnett, D.G. Payne, R.A. van de Geijn, J. Watts, Broadcasting on meshes with worm-hole
routing. J. Parallel Distrib. Comput. 35(2), 111–122 (1996)
5. R.V. Boppana, S. Chalasani, C.S. Raghavendra, Resource deadlocks and performance of
wormhole multicast routing algorithms. IEEE Trans. Parallel Distrib. Syst. 9(6), 535–549
(1998)
6. A.A. Chien, J.H. Kim, Planar adaptive routing: low-cost adaptive networks for multiprocessors,
in Proceedings of the 19th International Symposium on Computer Architecture, Gold Coast,
May 1992, pp. 268–277
158 F.A. Samman and T. Hollstein
27. F.A. Samman, T. Hollstein, M. Glesner, Planar adaptive network-on-chip supporting deadlock-
free and efficient tree-based multicast routing method. Elsevier Sci. J. Microprocess. Microsyst.
Embed. Hardw. Des. 36(6), 449–461 (2012)
28. L. Wang, Y. Jin, H. Kim, E.J. Kim, Recursive partitioning multicast: a bandwidth-efficient
routing for networks-on-chip, in Proceedings of the 3rd ACM/IEEE International Symposium
on Networks-on-Chip (NOCS’09), San Diego, 2009, pp. 64–73
29. S. Yan, B. Lin, Custom networks-on-chip architectures with multicast routing. Trans. Very
Larg. Scale Integr. Syst. 17(3), 342–355 (2009)
30. H. Zimmermann, OSI reference model – the ISO model of architecture for open systems
interconnection. IEEE Trans. Commun. 8(4), 425–432 (1980)
Chapter 7
Path-Based Multicast Routing for 2D and 3D
Mesh Networks
7.1 Introduction
Source Destination
a b c
21 22 23 24 25 21 22 23 24 25 21 22 23 24 25
20 19 18 17 16 20 19 18 17 16 20 19 18 17 16
11 12 13 14 15 11 12 13 14 15 11 12 13 14 15
10 9 8 7 6 10 9 8 7 6 10 9 8 7 6
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
is inefficient. This inefficiency arises because of sending multiple copies of the same
message into the network. It not only results in a significant amount of traffic but
also introduces a large serialization delay at the injection point.
The vast majority of traffic in Multi-Processor Systems-on-Chip (MPSoCs)
consists of unicast traffic and most studies have assumed that the traffic is only
unicast. Based on this assumption, the concept of unicast communication has been
extensively studied [1–4]. In these approaches, to support a multicast message
a single unicast message is delivered per destination. The unicast protocols are
efficient when all injected messages are unicast. However, if only a small percentage
of total traffic is multicast, the efficiency of the overall system is considerably
reduced. For example, let us assume that only 5 % of traffic consists of multicast
messages, meaning that per 100 messages, 5 messages are multicast and 95
messages are unicast. For each unicast message, a single message is delivered
into the network (i.e. 95 messages). However, if each multicast message has 10
destinations, 50 unicast messages are sent to the network in order to support the
multicast operation. With this simple calculation, we notice that more than 50 %
of the whole traffic is because of multicast messages (50 messages for multicast
operation vs. 95 messages for unicast communication). Thereby, efficient multicast
support has a large impact on the performance of chip multi-processor systems. The
multicast communication is frequently present in many cache coherency protocols
(e.g. directory-based protocols, token-based protocols, and Intel QPI protocol
[5, 6]). For an instance, around 5 % of total traffic in a SGI-Origin protocol (i.e.
a directory based protocol) consists of multicast messages. It should be taken into
account that some cache coherence protocols are heavily multicast (e.g. the token-
based MOESI).
Hardware-based multicast schemes can be broadly classified into path-based
and tree-based methods. In the tree-based method, a spanning tree is built at the
source node and a single multicast message is sent down the tree (Fig. 7.1b).
The source node is considered as the root and destinations are the leaves of this
tree. The message is replicated along its route at intermediate nodes and for-
warded along multiple outgoing channels reaching disjoint subsets of destinations.
7 Path-Based Multicast Routing for 2D and 3D Mesh Networks 163
This replication at intermediate nodes may result in the blockage of messages [7, 8].
In the path-based method, a source node prepares a message for delivery to a set
of destinations. The message carries the destination addresses in the header. The
message is routed along the path until it reaches the first destination. The message
is delivered both to the local core and to the corresponding output channel for
continuing the path toward the next destination in the header. In this way, the
message is eventually delivered to all specified destinations (Fig. 7.1c). Notice that
the path-based approach does not replicate the multicast message along the path and
thus it does not have such blocking issues which exist in the tree-based methods.
However, the path visiting all nodes can become relatively long.
The path-based routing algorithms are commonly based on the Hamiltonian path
where an undirected Hamiltonian path is constructed in the network [10]. A
Hamiltonian path visits every node in a graph exactly once. For each node in an
a×b mesh network, a label L(x, y) is assigned as:
(a × y) + (x + 1) if y is even
L (x, y) =
(a × y) + (a − x) if y is odd
where x and y are the coordinates of the node. Figure 7.2a shows the labeling
assignment based on this equation in a 3 × 4 mesh network.
As exhibited in Fig. 7.2b, c, two directed Hamiltonian paths (or two sub-
networks) are constructed by labeling the nodes. The high channel sub-network
(GH ) starts at 1 (Fig. 7.2b) and the low channel sub-network (GL ) ends at 1
(Fig. 7.2c). If the label of the destination node is greater than the label of the source
node, the routing always takes place in the high channel sub-network; otherwise, it
takes place in the low channel sub-network. The destinations are divided into two
groups. One group contains all the destinations that could be reached using GH and
the other contains the remaining destinations that could be reached using GL . To
reduce the path length, the vertical channels that are not part of the Hamiltonian
path (the dashed lines in Fig. 7.2) could be used in appropriate directions. In fact, all
messages in the high channel (low channel) sub-network follow paths in the strictly
164 M. Ebrahimi et al.
a b c
3 10 11 12 12 11 10 12 11 10
2 7 8 9 7 8 9 7 8 9
1 6 5 4 6 5 4 6 5 4
0 1 2 3 1 2 3 1 2 3
Y
X 0 1 2
Physical network High channel sub-network Low channel sub-network
Fig. 7.2 (a) A 3 × 4 mesh physical network with the label assignment (b) high channel sub-
network and (c) low channel sub-network [11]. Solid lines indicate the Hamiltonian path and
dashed lines indicate the links that could be used to reduce the path length
ascending (descending) order (using either solid lines or dashed lines), no cyclic
dependency can be formed among channels; thus the routing algorithm based on the
Hamiltonian path is deadlock-free.
odd 20 19 18 17 16
even 11 12 13 14 15
odd 10 9 8 7 6
even 1 2 3 4 5
GL
odd 20 19 18 17 16 odd 20 19 18 17 16
even 11 12 13 14 15 even 11 12 13 14 15
odd 10 9 8 7 6 odd 10 9 8 7 6
even 1 2 3 4 5 even 1 2 3 4 5
To reduce the path length, in Multi-Path (MP) partitioning, GH and GL are also
partitioned [10]. GH is divided into two subsets (GH : GH1 ,GH2 ). If the source node
is located in an odd row, GH1 covers the nodes whose x coordinates are smaller than
that of the source node and the other subset, GH2 , contains the remaining nodes in
GH (Fig. 7.4a). If the source node is located in an even row, GH1 covers the nodes
whose x coordinates are smaller than or equal to the source node and GH2 covers the
rest of the destinations (Fig. 7.4b). GL is partitioned in a similar way into two subsets
(GL: GL1 ,GL2 ). Hence, all destinations of a multicast message are grouped into four
disjointed sub-networks. An example of MP is illustrated in Fig. 7.4b for a multicast
166 M. Ebrahimi et al.
odd 20 19 18 17 16
even 11 12 13 14 15
odd 10 9 8 7 6
even 1 2 3 4 5
a
21 22 23 24 25 21 22 23 24 25
20 19 18 17 16 20 19 18 17 16
11 12 13 14 15 11 12 13 14 15
10 9 8 7 6 10 9 8 7 6
1 2 3 4 5 1 2 3 4 5
20 19 18 17 16 20 19 18 17 16
11 12 13 14 15 11 12 13 14 15
10 9 8 7 6 10 9 8 7 6
1 2 3 4 5 1 2 3 4 5
Fig. 7.6 Unicast and multicast routing (a) before applying HAMUM (b) after applying HAMUM
a b
even 21 22 23 24 25 even 21 22 23 24 25
odd 20 19 18 17 16 odd 20 19 18 17 16
even 11 12 13 14 15 even 11 12 13 14 15
odd 10 9 8 7 6 odd 10 9 8 7 6
even 1 2 3 4 5 even 1 2 3 4 5
Fig. 7.8 An example of HAMUM for (a) unicast messages (b) multicast messages
48 47 46 45 44 43 42 41 odd
33 34 35 36 37 38 39 40 even
32 31 30 29 28 27 26 25 odd
17 18 19 20 21 22 23 24 even
16 15 14 13 12 11 10 9 odd
1 2 3 4 5 6 7 8 even
Since the Odd-Even model [1] is one of the most popular adaptive routing
algorithms for unicast communication, we compare the adaptivity of HAMUM
with the Odd-Even model. Figure 7.9 shows all possible shortest paths based on
HAMUM taken by four messages in an 8 × 8 mesh network. All of the possible
routing paths for the Odd-Even model are indicated in Fig. 7.10.
In order to compare these two algorithms with each other, we use the Degree of
Adaptiveness (DoA) factor [1, 11], which is the number of minimal paths that can
be taken by a message to traverse from a source node (Xs ,Ys ) to a destination node
(Xd ,Yd ). Assuming that Δx = Xd –Xs and Δy = Yd –Ys , the number of hops between
7 Path-Based Multicast Routing for 2D and 3D Mesh Networks 171
Fig. 7.10 All possible even odd even odd even odd even odd
shortest paths from the source
nodes 1, 8, 57, and 64 to the 64 63 62 61 60 59 58 57
destination node 28 using
odd-even
49 50 51 52 53 54 55 56
48 47 46 45 44 43 42 41
33 34 35 36 37 38 39 40
32 31 30 29 28 27 26 25
17 18 19 20 21 22 23 24
16 15 14 13 12 11 10 9
1 2 3 4 5 6 7 8
Table 7.1 Eight different Source row Destination row Destination position
states regarding the source State (odd/even) (odd/even row) (left/right)
and destination positions
1 Even Even Right
2 Even Odd Right
3 Even Even Left
4 Even Odd Left
5 Odd Even Right
6 Odd Odd Right
7 Odd Even Left
8 Odd Odd Left
the source and destination nodes are as: dx = |Δx|, and dy = |Δy|. The degree of
adaptiveness for a fully adaptive algorithm is given by:
(dx + dy )!
DoA ( f ully adaptive routing)s,d =
dx !dy !
Based on the Hamiltonian Path, there can be eight different states considering
the source row (which can be even or odd), the destination row (that can be odd or
even), and the location of the destination node regarding the source node (left or
right side of the source node). These states are summarized in Table 7.1. Note that
when the source and destination nodes are located in the same dimension, only a
single path exists.
172 M. Ebrahimi et al.
Odd D
Odd D Even D
Even D
Odd S Odd S
Xd < Xs Xd < Xs
We compute the degree of adaptiveness for HAMUM in the high channel sub-
network. Similar method can be applied to the low channel sub-network. As can be
seen in Fig. 7.11, the degree of adaptiveness for the state 1 and 8 is equal and can
be computed as:
(dx + D)! dy
DoA (1)s,d = Where D =
dx !D! 2
a b
Even Even Odd Odd
Fig. 7.12 The rules of odd-even turn model; prohibited turns in (a) even columns (b) odd columns
When the destination node is to the left side of the source node (Δx < 0) the
following equation is obtained:
DoA(2)s,d source : odd and destination : odd
DoA(Δx < 0)s,d =
DoA(1)s,d otherwise
Considering the above analysis, the degree of adaptiveness of HAMUM and Odd-
Even is close to each other. However, the Odd-Even model is designed for unicast
communication and cannot be utilized for multicast traffic. On the other hand,
HAMUM not only is compatible with multicast traffic but also provides adaptivity
for both unicast and multicast messages.
174 M. Ebrahimi et al.
1 2 3 4
c1 c2 c3 c4
A B
Wait Hold
In this section, we extend the idea of partitioning methods and the adaptive
routing algorithm to a 3D mesh network. At first, different partitioning schemes are
introduced, called Dual-Path (DP), Vertical-Path (VP), and Recursive partitioning
(RP) methods, and then the minimal and adaptive routing algorithm (MAR) is
proposed [12].
a
25 26 27
24 23 22
19 20 21
12 11 10
Z_axis
13 14 15
is
18 17 16 ax
Y
7 8 9 X axis
6 5 4
1 2 3
b c
25 26 27 25 26 27
24 23 22 24 23 22
19 20 21 19 20 21
12 11 10 12 11 10
13 14 15 13 14 15
18 17 16 18 17 16
7 8 9 7 8 9
6 5 4 6 5 4
1 2 3 1 2 3
Fig. 7.14 (a) A 3 × 3 × 3 mesh network (b) high channel sub-network (c) low channel sub-
network
coordinates. The following equations show one possibility of assigning the labels,
which we utilize:
The labels of nodes based on this equation are shown in Fig. 7.14a. As
exhibited in Fig. 7.14b, c, two directed Hamiltonian paths (or two sub-networks)
are constructed by this labeling, similar to a 2D mesh network. In a case that the
label of the destination node is greater than the label of the source node, the routing
always takes place in the high channel sub-network; otherwise in the low channel
sub-network.
176 M. Ebrahimi et al.
a GH b GH
48 47 46 45 48 47 46 45
41 42 43 44 41 42 43 44
40 39 38 37 40 39 38 37
33 34 35 36 33 34 35 36
17 18 19 20 17 18 19 20
24 23 22 21 24 23 22 21
25 26
25 27 28 25 26
25 27 28
32 31 30 29 32 31 30 29
16 15 14 13 16 15 14 13
9 10 11 12 9 10 11 12
8 7 6 5 8 7 6 5
1 2 3 4 1 2 3 4
GL GL
41 42 43 44 41 42 43 44
40 39 38 37 40 39 38 37
33 34 35 36 33 34 35 36
17 18 19 20 17 18 19 20
24 23 22 21 24 23 22 21
25 25
26 27 28 25 26 27 28
32 31 30 29 32 31 30 29
16 15 14 13 16 15 14 13
9 10 11 12 9 10 11 12
8 7 6 5 8 7 6 5
1 2 3 4 1 2 3 4
GH: 41 nodes
a GH: 22 nodes b GH1=21 nodes GH2=20 nodes
GH1: max 10 nodes GH2: max 12 nodes GH1:11 GH2:10 GH3:10 GH4:10
48 47 46 45 48 47 46 45
41 42 43 44 41 42 43 44
40 39 38 37 40 39 38 37
33 34 35 36 33 34 35 36
17 18 19 20 17 18 19 20
24 23 22 21 24 23 22 21
25 26
25 27 28 25 26
25 27 28
32 31 30 29 32 31 30 29
16 15 14 13 16 15 14 13
9 10 11 12 9 10 11 12
8 7 6 5 8 7 6 5
1 2 3 4 1 2 3 4
GL1:7 GL2:6 GL3: 12 nodes GL: 6 nodes
GL1: 13 nodes
GL: 25 nodes
The objective of the Recursive Partitioning (RP) method is to optimize the number
of nodes that can be included in a partition and thus to achieve a better parallelism.
In this method, the network is recursively partitioned until each partition contains k
nodes. In the worst case, the network is partitioned into 2a vertical partitions like in
the VP method. We have considered the value k as a reference value indicating the
number of nodes in each partition of the VP method, i.e. (k = bc) in an a × b × c
network. An example of the RP approach is illustrated in Fig. 7.17a where a
multicast message is generated at the source node 26. The required steps of the
RP method can be expressed as follows:
Step1: The value k is set to 12 in a 4 × 4 × 3 network.
Step2: The network is divided into two partitions using the DP method. Figure 7.15a
shows two formed partitions when the source node is located at 26.
Step3: If the number of nodes in a partition exceeds the reference value k, the
partition is divided into two new partitions. This step is repeated until all
partitions cover at most k nodes. Following the example of Fig. 7.15a, 22 nodes
are covered by the high channel sub-network which is greater than k = 12. The
high channel sub-network needs to be further divided into two new partitions
(GH1 and GH2 as shown in Fig. 7.17a). The GH1 and GH2 partitions contain 10
and 12 nodes, respectively. Since both numbers are less than or equal to k = 12,
no further partitioning is required for the high channel sub-network. The same
partitioning technique is applied to the low channel sub-network.
7 Path-Based Multicast Routing for 2D and 3D Mesh Networks 179
Figure 7.17b shows another example of the RP method where the multicast
message is m = (7,{2,3,20,26,45}). In this example, three messages are formed and
their paths are PH2 = {7,10,11,12,13,20,45}, PH4 = {7,26}, and PL = {7,6,3,2}, and
the maximum latency is six hops. In brief, this scheme has a similar performance
in avoiding long paths as the VP method while it provides better parallelism
since the number of nodes is comparable among partitions. By considering the RP
approach, the creation of balanced partitions is less dependent of the source node
position. Therefore, RP is able to avoid long paths in the network and increases the
parallelism while keeping the number of startup messages relatively low.
We present a minimal and adaptive routing algorithm (MAR) based on the Hamil-
tonian path. Using MAR, unicast and multicast messages can be adaptively routed
inside the network. All routes used by the unicast messages are the shortest paths.
Although the overall path of a multicast message might be non-minimal, the paths
between each two destinations in the overall multicast path are minimal. Each node
in the graph has a label (L) determined by the Hamiltonian path labeling mechanism.
The MAR algorithm is implemented at the routing units and can be described in
three steps as follows:
Step1: it determines the neighbors of the node u that can be used to move a message
closer to its destination d. The pseudo code for Step1 is shown in Fig. 7.18.
Step2: due to the fact that in the Hamiltonian path all nodes are visited in the
ascending order (high channel sub-network) or descending order (low channel
sub-network), all of the selected neighbors in Step1 do not necessarily satisfy the
ordering constraint. Therefore, if the labels of the selected neighbors (in Step1)
are between the label of the node u and destination d, it/they can be selected as
the next hop. The pseudo code for Step2 is shown in Fig. 7.18.
Step3: since the MAR algorithm provides several choices at each node, the goal
of Step3 is to route a message through the less congested neighboring node. So,
in the case when the message can be forwarded through multiple neighboring
nodes, the congestion values of the corresponding input buffers of the candidate
neighbors are checked and then the message is sent to the neighbor with the
smallest stress value.
An example of the MAR algorithm is illustrated in Fig. 7.19a where the source
and destination are located at the nodes 6 and 48, respectively. According to the
algorithm, in the first step the neighbors are chosen in a manner that gets the message
closer to its destination, i.e. n = {7,11,27}. In the second step, the selected neighbors
(in Step1) are checked to determine whether they are in the Hamiltonian path or not.
Since the labels of the three selected neighbors are between the labels of the current
node (u = 6) and the destination node (d = 48), the message can be routed via each
of them. Suppose that the neighbor p = 11 has a smallest congestion value, so the
algorithm selects this neighbor to forward the message. If we continue with the node
180 M. Ebrahimi et al.
u = 11, this node has three neighboring nodes belonging to the minimal paths, i.e.
n = {10,14,22}. However, only two of them (n = {14,22}) have the labels greater
than the label of the current node (u = 11) and lower than the label of the destination
node (d = 48).
Finally, according to the stress values of the input buffers, one of them is selected
as the next hop. The algorithm is repeated for the rest of the nodes until the message
reaches the final destination. It is worth noting that the stress value is updated
whenever a new flit enters or leaves the buffer (flit events: flit_tx or flit_rx). That
is, in each flit event, if the number of occupied cells of the input buffer is larger
(smaller) than a threshold value, the threshold signal is assigned to one (zero).
7 Path-Based Multicast Routing for 2D and 3D Mesh Networks 181
a b
48 47 46 45 48 47 46 45
41 42 43 44 41 42 43 44
40 39 38 37 40 39 38 37
33 34 35 36 33 34 35 36
17 18 19 20 17 18 19 20
24 23 22 21 24 23 22 21
25 26
25 27 28 25 25
26 27 28
32 31 30 29 32 31 30 29
16 15 14 13 16 15 14 13
9 10 11 12 9 10 11 12
8 7 6 5 8 7 6 5
1 2 3 4 1 2 3 4
Fig. 7.19 The MAR algorithm (a) for a unicast message (b) for a multicast message
The MAR algorithm can be adapted for multicast messages such that alternative
paths are used to route a message between the source node and the first destination
and also between successive destinations.
An example of MAR is shown in Fig. 7.19b where the source node (u = 6)
forwards a multicast message towards three destinations (D = {15,32,46}). The
MAR algorithm provides a set of alternative paths to send a message from the source
node to the first destination (d1 = 15). Similarly, the message can be adaptively
routed between each two destinations. For example, at the node 15, the message
can make progress towards the destination 32 either by selecting the node 18 in the
next layer or the node 16 in the current layer. MAR is compatible with all methods
supporting the Hamiltonian path. Therefore, all partitioning methods can utilize the
MAR algorithm for both unicast and multicast messages.
Dual-Path (DP), Multi-Path (MP), and Column-Path (CP) employing the HAMUM
routing algorithm are implemented in a 2D mesh network. In a 3D mesh network,
Dual-Path (DP), Vertical-Path (VP), and Recursive Partitioning (RP) methods along
with the MAR routing algorithm are implemented. We have developed a cycle
accurate wormhole-based 2D and 3D NoC simulator. The simulator inputs include
the array size, the operating frequency, the routing algorithm, the link width, and the
182 M. Ebrahimi et al.
traffic type. Each switch in a 3D mesh network has 7 input/output ports, a natural
extension from a 5-port 2D switch by adding two ports to make connections to the
upper and lower layers [13, 14]. There are some other types of 3D switches such
as the hybrid switch [13, 15] and MIRA [16], however, since switch efficiency is
out of the goals of these methods, we have chosen a simple 7-port switch in our
simulation.
Each input channel has a buffer (FIFO) size of 8 flits with the congestion
threshold set at 75 % of the total buffer capacity. For all nodes, the data width and the
frequency were set to 32 bits and 1GHz, respectively, which led to a bandwidth of 32
Gb/s. The message size was assumed to be 16 flits. For the performance metric, we
use the multicast latency defined as the number of cycles between the initiation of
the multicast message operation and the time when the tail of the multicast message
reaches all the destinations. The preparation mechanism consists of partitioning the
destination set into appropriate subsets and creating multiple copies of the message.
All of these steps are performed at runtime.
The first sets of simulations were done for a random traffic profile. In these
simulations, each processing element generates data messages and injects them into
the network according to the time intervals which are obtained using the exponential
distribution. The mesh sizes are considered 8 × 8 and 4 × 4 × 4 in a 2D and 3D
network, respectively. In the multicast traffic profile, each processing element sends
a message to a set of destinations. A uniform distribution was used to construct the
destination set of each multicast message. The number of destinations was set to 10.
The average communication delay as a function of the average flit injection rate
in a 2D network is shown in Fig. 7.20a. According to the results, the proposed
CP multicast routing algorithm leads to the lowest latency among the two other
multicast approaches (DP and MP) and a simple unicast-based multicast support
(UB). Figure 7.21a shows the performance gain of using HAMUM. As observed
from the results, the performance of the MP and CP schemes increases when
applying HAMUM (AMP and ACP). This is due to the fact that HAMUM brings
adaptivity to all unicast and multicast messages.
As can be seen from the results in Fig. 7.20b, the RP method meets lower delay
than the DP and VP methods. The foremost reason for this performance gain is
due to the efficiency of the RP method which not only reduces the number of hops
for multicast messages but also the number of startup messages. In fact, the DP
approach suffers from long paths while the performance of the VP method degrades
due to a large number of startup messages. Adaptive routing algorithms obtain better
performance in congested networks due to using alternative paths. In Fig. 7.21b,
ARP (Adaptive RP), utilizing MAR in RP, and AVP (Adaptive VP), utilizing MAR
in VP, are the adaptive models of the RP and VP, respectively. Results show that
adaptive routings become more advantageous when the injection rate increases.
7 Path-Based Multicast Routing for 2D and 3D Mesh Networks 183
b 350
DP-3D
Average Latency (cycle)
300
RP-3D
250
200 VP-3D
150
100
50
0
0 0,1 0,2 0,3 0,4
Message Injection Rate (messages/cycle)
b
350
AVP-3D
Average Latency (cycle)
300
VP-3D
250 RP-3D
200 ARP-3D
150
100
50
0
0 0,1 0,2 0,3 0,4
Message Injection Rate (messages/cycle)
To show the efficiency of the proposed model under the application traffic profiles,
traces from some benchmark suites selected from SPLASH-2 [17] and PARSEC
[18] are used. Traces are generated from SPLASH-2 and PARSEC using the GEMS
simulator [19]. We used the X264 application of PARSEC and the Radix, Ocean,
and fft applications from SPALSH-2 for our simulation.
Table 7.2 summarizes the full system configuration where the cache coherence
protocol is token-based MOESI and access latency to the L2 cache is derived from
the CACTI [20]. It is noteworthy that the token-based MOESI protocol is heavily
based on multicast traffic. On account of our analysis on average 80 % of traffic in
token-based MOESI is multicast. We form a 64-node on-chip network (i.e. 8 × 8
and 4 × 4 × 4 mesh networks). Out of the 64 nodes, 16 nodes are processors and
other 48 nodes are L2 caches. In a 2D network, processors are located in the first
and last rows. In a 3D network, L2 caches are distributed in the bottom three layers,
while all the processors are placed in the top layer close to a heat sink so that the
7 Path-Based Multicast Routing for 2D and 3D Mesh Networks 185
300
250 RP-3D
200 VP-3D
150
100
50
0
0 0,1 0,2 0,3
Message Injection Rate (messages/cycle)
best heat dissipation capability is achieved [16, 21]. For the processors, we assume a
core similar to Sun Niagara and use SPARC ISA [22]. Each L2 cache core is 1 MB,
and thus, the total shared L2 cache is 48 MB. The memory hierarchy implemented
is governed by a 2-level directory cache coherence protocol. Each processor has a
private write-back L1 cache (split L1 I and D cache, 64 KB, 2-way, 3-cycle access).
The L2 cache is shared among all processors and split into banks (48 banks, 1 MB
each for a total of 48 MB, 6-cycle bank access). The L1/L2 block size is 64B. The
simulated memory hierarchy mimics SNUCA [23] while the off-chip memory is a
4GB DRAM with a 260-cycle access time.
Figure 7.24a shows the average network latency of the real workload traces
collected from the aforementioned system configurations, normalized to MP in a 2D
network. As can be seen from this figure, the proposed adaptive model, HAMUM,
diminishes the average delay of MP and CP significantly under all benchmarks. That
is, adaptive routing has an opportunity to improve performance.
As can be seen from Fig. 7.24b in a 3D network, the recursive partitioning
method using MAR consistently reduces the average network latency across all
tested benchmarks.
186 M. Ebrahimi et al.
300
VP-3D
250 RP-3D
200 ARP-3D
150
100
50
0
0 0,1 0,2 0,3 0,4
Message Injection Rate (messages/cycle)
a MP AMP CP ACP
1
1
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0
x264 FFT Ocean Radix
Fig. 7.24 Performance evaluation under different application benchmarks (a) 2D network
(b) 3D network
a 3D network use the MAR routing algorithm. Therefore, the differences in the
hardware overhead of different methods are because of the partitioning methods
but not the routing units. While the same routing algorithm was used for the CP,
MP, and DP multicasting schemes, various numbers of registers were employed
in implementing their sorting mechanisms leading to different area overheads.
To estimate the hardware cost of the proposed methods, the network area of
each partitioning scheme, including switches and network interfaces, with the
aforementioned configuration were synthesized by Synopsys D.C. using the UMC
90 nm technology. The frequency and the supply voltage are 1GHz and 1 V,
respectively. We performed place-and-route, using Cadence Encounter, to have
precise power and area estimations. Based on our analysis, the area overheads of
MP and CP are 3 % and 5 % higher than that of the baseline method, DP.
In a 3D network, depending on the technology and manufacturing process, the
pitches of TSVs can range from 1 to 10 μ m square [24]. In this work, the pad size
for TSVs is assumed to be 5 μ m2 with pitch of around 8 μ m2 . VP and RP schemes
indicate 5 % and 6 % additional overhead over the area cost of DP, respectively. The
area overhead of the routing algorithms, HAMUM in a 2D network and MAR in a
3D network are negligible.
188 M. Ebrahimi et al.
7.5 Conclusion
References
1. G.-M. Chiu, The odd-even turn model for adaptive routing. Ieee Trans. Parall. Distrib. Syst.
11(7), 729–738 (2000)
2. P. Lotfi-Kamran, A.M. Rahmani, M. Daneshtalab, A. Afzali-Kusha, Z. Navabi, EDXY – a low
cost congestion-aware routing algorithm for network-on-chips. J. Syst. Arch. 56(7), 256–264
(2010)
3. M. Ebrahimi, M. Daneshtalab, F. Farahnakian, J. Plosila, P. Liljeberg, M. Palesi, H. Tenhunen,
HARAQ: Congestion-aware learning model for highly adaptive routing algorithm in on-chip
networks, in Proceedings of International Symposium on Networks-on-Chip (Denmark, 2012),
pp. 19–26
4. M. Ebrahimi, M. Daneshtalab, P. Liljeberg, J. Plosila, H. Tenhunen, CATRA- congestion aware
trapezoid-based routing algorithm for on-chip networks, in Proceedings of Design, Automation
Test in Europe Conference Exhibition (DATE) (Germany, 2012), pp. 320–325
5. N. E. Jerger, L.-S. Peh, M. Lipasti, Virtual circuit tree multicasting: A case for on-chip
hardware multicast support, in Proceedings of the 35th Annual International Symposium on
Computer Architecture (ISCA), vol. 36 (China, 2008), pp. 229–240
6. P. Abad, V. Puente, J. Gregorio, MRR: Enabling fully adaptive multicast routing for CMP
interconnection networks, in Proceedings of IEEE 15th International Symposium on High
Performance Computer Architecture (HPCA) (USA, 2009), pp. 355–366.
7. J. Duato, S. Yalamanchili, L. Ni, Interconnection Networks (Morgan Kaufmann, San Francisco,
2003)
8. R.V. Boppana, S. Chalasani, C.S. Raghavendra, Resource deadlocks and performance of
wormhole multicast routing algorithms. IEEE Trans. Parall. Distrib. Syst. 9(6), 535–549 (1998)
9. M. Ebrahimi, M. Daneshtalab, P. Liljeberg, H. Tenhunen, HAMUM – A novel routing protocol
for unicast and multicast traffic in MPSoCs, in Proceedings of 18th Euromicro International
Conference on Parallel, Distributed and Network-Based Processing (PDP) (Italy, 2010),
pp. 525–532
10. X. Lin, P.K. McKinley, L.M. Ni, Deadlock-free multicast wormhole routing in 2D mesh
multicomputers. IEEE Trans. Parall. Distrib. Syst. 5(8), 793–804 (1994)
7 Path-Based Multicast Routing for 2D and 3D Mesh Networks 189
Abstract As the semiconductor industry advances to the deep sub-micron and nano
technology points, the on-chip components are more prone to the defects during
manufacturing and faults during system life time. These components include the
Networks-on-Chip (NoCs) which are expected to be an important part of the future
complex multi-core and many-core chips. As a result, fault tolerant techniques are
essential to improve the yield of modern complex chips. In this chapter, we propose
a fault-tolerant routing algorithm that keeps the negative effect of faulty components
on the NoC power and performance as low as possible. Targeting intermittent faults,
we achieve fault tolerance by employing a simple and fast mechanism composed
of two processes: NoC monitoring and route adaption. The former keeps the track
of the on-chip traffic pattern and faulty links, whereas the latter adapts the packet
paths to the current set of faulty components. This mechanism exploits the global
information of the state of the NoC components and on-chip traffic pattern and aims
to minimize the performance loss and the power overhead imposed by the faulty
NoC links and nodes. Experimental results show the effectiveness of the proposed
technique, in that it offers lower average message latency and power consumption
and a higher reliability, compared to some state-of-the art related work.
8.1 Introduction
Benefitting from the global information about the fault patterns and the commu-
nication demand of each node with low overhead is the key advantage of our work
over the few works that handle intermittent faults [6].
Although FaulToleReR targets intermittent faults, it can be used to route packets
in any network with dynamic topology (such as wireless networks), as well as
irregular mesh networks with oversized IPs (OIPs) [11].
The rest of the chapter is organized as follows. Section 8.2 gives an overview
of the entire problem and the proposed solution. Section 8.3 presents the imple-
mentation details of the proposed fault-tolerant routing algorithm. The experimental
results are presented in Sect. 8.4, and finally Sect. 8.5 concludes the chapter.
broadcasting a request or a data from the root to the network, is done by the
root node by sending the request on its row along both the E and W directions.
Each control network node at the same row of the root node, when detecting the
message, sends it to the nodes in its column along both the N and S directions.
For the second transactions that carry information to the root, the control network
is divided into four quadrants with respect to the root node position and the nodes
inside each quadrant direct their messages to the node at the row or column in which
the root node is located and then the node at that row/column forwards the message
to the root node. Each node accepts the message of its previous node only when
it has finished sending its message to the next node along the assigned path. This
scheduling removes the need for buffering and complex arbitration on the control
network. The proposed scheduling schemes work correctly for any given location of
the root node, but it is more beneficial to map the root node into one of the central
nodes, since it allows more parallelization in data transmission.
Where BW(lk ) characterizes the bandwidth of link lk and Xk (i,j) is calculated as:
t (ei, j ) i f lk ∈ path (ei, j )
X k (i, j) = (8.2)
0 Otherwise
The parameter path(ei,j ) is the set of links onto which CTG edge ei,j is mapped.
This can be stated more simply as:
Where BR(lk ) is the total amount of traffic that travels on NoC link lk .
Formally, the problem of dynamic route calculation can be described as follows.
Given an NoC, which is obtained by removing faulty links (FL members) form
the total NoC links, and the NoC traffic pattern described by a communication task
graph CTG, find a new route for each flow (each edge ei,j in CTG) that minimizes
the following expression, where route(i,j) is the distance (in terms of hop count)
between nodes i and j:
The following expression can be postulated as the average message latency of the
network as the latency of a packet is the sum of its zero-load latency and blocking
latency. Zero load latency is the time (in terms of cycles) it will take for a packet
to traverse from the source to the destination node when there is no contention
and all resources are available. However, in reality, a packet should compete with
other packets for the NoC bandwidth. This competition causes blocking latency in
switches. If the network is not congested, the number of hops traversed by a packet
to reach the destination node is roughly proportional to packet’s latency. Therefore,
minimizing Eq. 8.4 will lead to minimizing the average message latency in the NoC.
The power consumed to transmit packets is also directly proportional to packet path
length. Consequently, by specifying the packet routes in such a way that each packet
is routed through a minimal path (based on the constraints of the new topology),
whereas the bandwidth constraint of each link is not violated, we can guarantee that
the power consumption and packet latency of the NoC is kept as low as possible. In
the next section, we show how FaulToleReR algorithm can achieve this goal.
The FaulToleReR algorithm is capable of rerouting packets upon both topology and
traffic changes by a two–step routing procedure. The first step is the initial step
which finds all available paths for each flow of the CTG. These preliminary paths are
used in the second step to find final possible paths for each source-destination pair,
8 Fault-Tolerant Routing Algorithms in Networks On-Chip 199
in order to minimize Eq. 8.4. FaulToleReR needs no virtual channel for deadlock
handling, but it can use virtual channels to improve performance. The algorithm is
outlined in Fig. 8.1.
The Initiation function launches when the system starts up and whenever the
on-chip traffic (represented by the CTG) changes. This function consists of two
sub-functions, Sort and Find_All_Paths. The Sort function is a modified version of
bubble sort which sorts the flows according to some predefined priority. As we target
to optimize the power and latency, the priority criteria is the communication volume
of flows. Hence, heavier flows have higher priority. These sorted flows and dis_tsh,
a path length threshold, are passed to the Find_All_Paths function as input, to find
all possible paths for each flow that are shorter than dis_tsh (in terms of hop count).
This function is a customized version of Dijkstra’s algorithm with the modifications
proposed in [12]. The customized algorithm significantly reduces the computations
of the algorithm and now is suitable for online execution. We refer interested readers
to [12] for more details of the algorithm. The output of this sub-function is a hash
set of the paths, indexed by the links they include. Once this step is done, the basic
step of the algorithm begins.
200 R.J. Behrouz et al.
After the initial process, the algorithm enters the main function, to find an
appropriate route based on the current network configuration. This function is
triggered by either the system start-up or a fault occurrence. Firstly, it checks for a
topology change and fault occurrence. If it is not the first time this function launches,
the set of indexed paths should be pruned by removing the faulty links.
Before explaining the rest of the algorithm, it is worth to bearing in mind that we
require finding a path for each flow that satisfies the following conditions:
1. The set of all found paths should minimize Eq. 8.4 in order to reach the
minimum average latency across all communication flows. The minimum value
of Eq. 8.4 arises when each packet reach its destination via a shortest path.
However, this may not be possible due to faulty links, deadlock occurrence, or
the constraints imposed by Eq. 8.3. Therefore, some of the flows should detour
from a shortest path.
If the shortest path of a flow and the actual path assigned to it are represented
by route* (i,j) and route(i,j), respectively, we define the detour overhead parameter
of a flow as:
Since the amount of overall detour overhead is the difference between minimum
average path length and average path length gained by actual routing, minimizing
Eq. 8.4 is equivalent to minimizing the cumulative detour overhead of the flows
defined as:
D= ∑ d f i, j (8.6)
∀ei, j ∈E
2. The overall traffic passing over a link should not exceed the maximum available
link bandwidth to avoid congestion (Eq. 8.3). We will later see that this constraint
leads to a better NoC performance and increases the network tolerance against
faults. If the total communication volume of two or more flows mapped onto the
same link violates this constraint, an unacceptable situation called overlapping
occurs.
3. Routing configuration should be deadlock and livelock free. Many algorithms do
not engage with deadlock problem and use virtual channels to escape deadlock
scenarios. However, FaulToleReR avoids deadlocks during path selection to relax
the VC count constraint experienced by many routing algorithms.
Back to the algorithm, the Get_Minimal_Paths function, which is the heuristic
part of FaulToleReR algorithm, aims at meeting the first condition mentioned above.
This function starts from the minimum possible value for D, the overall detour
overhead, and returns the set of paths by which this value is achieved (The minimum
value is 0 when each flow is directed through the shortest path). If the set of
selected paths is approved by Link_Load and Deadlock_Free functions which check
if the candidate set meets the second and third conditions, respectively, the goal is
8 Fault-Tolerant Routing Algorithms in Networks On-Chip 201
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
(6,2,2) 180
(4,2,6) 200
(4,6,2)* 320
dependency graph, CDG, of the selected paths and check if it is acyclic or not. In
case of not finding any cycles, the routing configuration is certainly deadlock free.
FaulToleReR is also livelock free. In other words, it definitely finds a route for
each flow to its destination as long as the network is not disconnected. As mentioned
before, we place a bandwidth occupancy constraint on each link to avoid network
saturation. Consequently, we extended the meaning of disconnection, in which two
nodes are disconnected not only when there is no physical paths between them,
but also when using the possible paths between them may lead to violating the
bandwidth constraint of some links.
Our proposed algorithm applies kind of fairness in routing, so that flows with
lower priority are not always sacrificed by those with higher priority. Intuitively,
FaulToleReR seems to choose to detour a flow with lower communication volume in
favor of an overlapping flow with higher communication volume. However, it is not
8 Fault-Tolerant Routing Algorithms in Networks On-Chip 203
a 40 70 90 b
0 1 2 3 0 1 2 3
4 5 6 7 4 5 6 7
8 9 10 11 8 9 10 11
12 13 14 15 12 13 14 15
Fig. 8.3 The impact of fairness on detour overhead in the proposed routing. Routing three flows
(a) without fairness and (b) with fairness
the best strategy in all cases. For example, suppose a network configuration shown in
Fig. 8.3. There are three flows with communication volume of 90, 70, and 40 Mbps,
and the maximum link bandwidth is 100 Mbps. As there are common links between
shortest paths of the flows and the total link load cannot exceed the bandwidth
limitation, some flows should detour their shortest paths. Figure 8.3a illustrates
a case without fairness where two lower flows make a detour from their shortest
paths to secure a shortest path for the flow with higher volume. Consequently, the
total detour overhead is 200 (140 for the 70 Mbps flow and 80 for the 40 Mbps
flow and according to Eq. 8.6). On the other hand, the routing configuration shown
in Fig. 8.3b, which has been obtained by misrouting the largest flow, has a total
detour overhead of 180, that is obviously preferable to the total detour overhead of
Fig. 8.3a.
a
90
80
Latency (cycles)
70
60
50
40
30
20
10
0 4 8 12 16 20 24 28 32 36 40
Number of Faulty Links
b
110
90
Latency (cycles)
70
50
30
10
0 4 8 12 16 20 24
Number of Faulty Links
c
80
70
Latency (cycles)
60
50
40
30
20
10
0 4 8 12 16 20 24
Number of Faulty Links
Fig. 8.4 Average message latency comparison between FaulToleReR, Planar, and a highly
resilient algorithm for a given number of faults with (a) GSM, (b) MMS, and (c) a uniform input
traffic
We further compare FaulToleReR with the two other mentioned routing algorithms
in terms of power consumption. As FaulToleReR forwards packets via shorter paths
than the other algorithms, packets pass fewer intermediate routers. Hence we expect
206 R.J. Behrouz et al.
our proposed algorithm to consume less average power than the others. The total
power consumed by a packet is the sum of the power it consumes in each node by:
writing/reading the packet to/from buffer, crossbar traversal, link traversal, route
computation, virtual channel arbitration, and switch arbitration.
Figure 8.5 displays the power consumption of FaulToleReR and Planar routing
algorithms in a 6 × 6 mesh under the GSM traffic. The figure shows that FaulT-
oleReR offers less power consumption when there are less than 27 faulty links.
As Fig. 8.4.a shows, the network enters the saturation state by the planar routing
algorithm in presence of 22 and more faults. From this point on, FaulToleReR
consumes more power as it is still working normally, while under the other
algorithms, NoC traffic acceptance rate decreases significantly, leading to limiting
the power consumption of NoC. As a result, the power consumption of FaulToleReR
is higher than the other algorithms after the saturation point.
We then calculate the power-delay product (PDP) of the three NoCs to give
a better understanding about the behavior of the routing algorithms in dealing
with faults. Figure 8.6 shows PDP for a given number of faults and traffic for
FaulToleReR, Planar, and HRA. Following the same trend as the latency results,
the PDP of our algorithm outperforms the other considered algorithms under all
benchmarks. In Fig. 8.6, we have excluded static power form results as it is not
affected by the routing algorithm.
a
Power-Delay Product
65
45
25
5
0 4 8 12 16 20 24 28 32 36 40
Number of Faulty Links
b
75
Power-Delay Product
65
55
45
35
25
15
5
0 2 4 6 8 10 12 14 16 18 20 22
c Number of Faulty Links
300
Power-Delay Product
250
200
150
100
50
0 2 4 6 8 10 12 14 16 18 20 22 24 26
Number of Faulty Links
Fig. 8.6 Power-delay product comparison between FaulToleReR, Planar, and HRA for a given
number of faults with (a) GSM, (b) MMS, and (c) a uniform input traffic
Figure 8.7 shows reliability of FaulToleReR, planar, and HRA for different NoC
sizes, for different number of faults. HRA is 90% reliable when 10% of the links are
broken, while this reliability is obtained by planar and FaulToleReR when 37% and
54% of links are broken, respectively.
208 R.J. Behrouz et al.
100
FaulToleReR_4
99
98 FIC09_4
Reliability (%)
97 Planar_4
96
FaulToleReR_6
95
FIC09_6
94
93 Planar_6
92 FaulToleReR_8
91
FIC09_8
0 4 8 12 16 20 24 28 32 36 40
Number of Injected Faults (faulty links) Planar_8
Fig. 8.7 Reliability of FaulToleReR, planar, and HRA versus number of faults
In the final experiment, we study the impact of the fault location on latency.
We compare FaulToleReR with the reconfigurable routing algorithm presented in
[19] which tolerates single-node failure. The authors in [19] have considered nine
contours (four of them at the corners, four between each two corners, and one at the
middle) in a 2D mesh network where a single-node failure can occur.
Figure 8.8 depicts the packet latency of a 6 × 6 mesh (for a given traffic load
packet/node/cycle) of four contours. According to the results shown in Fig. 8.8b,
the routing scheme proposed in [19] uses an X-Y-like routing with local fault
information. Hence, it leads to sooner saturation point, compared to FaulToleReR
whose behavior is shown in Fig. 8.8a. Since the middle routers of the mesh (not
placed at the network boundaries) play a more significant role in packet routing, a
node failure in contour5 leads to earlier network saturation.
8.5 Conclusion
Fig. 8.8 The impact of single-node fault location and traffic load (packet/node/cycle) on (a) FaulT-
oleReR and (b) the algorithm proposed in [19]
References
Masoumeh Ebrahimi
Abstract Faults may have undesirable effects on the correct operation of the system
or at least the performance. NoC inherently has a potential of being a more reliable
infrastructure than buses by providing alternative paths between each pair of source
and destination routers. However, this potential cannot be utilized without the
support of fault-tolerant routing algorithms. In this chapter, we take a detail view
of implementing high-performance fault-tolerant routing algorithms in 2D and 3D
mesh networks. The required turn models are discussed and all fault conditions are
investigated. As faults may occur links and routers, this chapter investigates both
types of faults. Unlike traditional methods in which the performance is degraded
significantly in faulty situations, the proposed fault-tolerant routing algorithms
perform well in maintaining the performance of a faulty network. This is achieved
by using the shortest paths as long as possible to bypass faults while non-minimal
paths are used when necessary. The proposed methods can be adjusted to balance
between reliability and performance. This chapter gives an extensive knowledge to
develop a fault-tolerant routing algorithm based on the characteristics of an NoC.
9.1 Introduction
M. Ebrahimi ()
University of Turku, Turku, Finland
e-mail: masebr@utu.fi
be rerouted around a fault when the source and destination are located along a same
dimension and there is a fault between them. We will prove that minimal and non-
minimal routing can be used by packets without creating any cycle in the network
and thus the routing algorithms are deadlock-free.
Several methods are presented in the realm of 2D NoCs in order to balance the
traffic load over the network. DyXY [13] is a fully adaptive routing algorithm using
one and two virtual channels along the X and Y dimensions, respectively. There are
few partially and fully adaptive algorithms in a 3D mesh network. MAR [14] is a
partially adaptive routing algorithm for 3D NoCs which is based on the Hamiltonian
path. It is a simple approach providing the adaptivity without using virtual channels.
A fully adaptive routing algorithm in a 3D mesh network is presented in [15], called
DyXYZ. Using this algorithm, packets are able to take any shortest paths between
the source and destination routers. DyXYZ requires two, four, four virtual channels
along the X, Y, and Z dimensions, respectively, to provide fully adaptiveness.
A number of studies presented solutions to tolerate faulty links or routers in
a 2D mesh network. The presented method in [16] can tolerate a large number
of faults without using virtual channels. However, this approach takes advantage
of a routing table at each router and an offline process to fill out the tables. The
presented algorithm in [17] does not require any routing tables, but packets take
unnecessary non-minimal paths. In this algorithm, an output hierarchy is defined
for each position in the network. According to the positions of the current and
destination routers, the routing algorithm scans the hierarchy in the descending order
and selects the highest priority direction which is not faulty. BFT-NoC [18] presents
a different perspective to tolerate faulty links. This method tries to maintain the
connectivity between the routers through a dynamic sharing of surviving channels.
Zhen Zhang et al. present an approach [19] to tolerate a single faulty router in the
network without using virtual channels. The main idea of this algorithm is to reroute
packets through a cycle free contour surrounding a faulty router. Each router should
be informed about the faulty statuses of eight direct and indirect neighboring routers.
DBP [20] method uses a lightweight approach to maintain the network connectivity
among non-faulty routers. In this method, besides the underlying interconnection
infrastructure, all routers are connected to each other via an embedded unidirectional
cycle (e.g. a Hamiltonian cycle or a ring along a spanning tree). A default back-up
path is used at each router in order to connect the upstream to the downstream
router. All of the mentioned algorithms may take unnecessary non-minimal routes
to tolerate faults, which increases the latency of packets significantly. Four high-
performance fault-tolerant approaches have been presented in a 2D mesh network,
which are based on using the shortest paths. Two of these algorithms tolerate faulty
links (MD [21] and MAFA [22]) and two others tolerate faulty routers (HiPFaR [23]
and MiCoF [24]). This chapter is mainly based on these sets of works.
214 M. Ebrahimi
There are two types of complete cycles that can be formed in the network, known
as clockwise and counter-clockwise (Fig. 9.1a). The creation of cycles may lead to
a b
Fig. 9.1 (a) Clockwise and counter-clockwise turns; (b) permitted and prohibited turns in the XY
routing algorithm (Note that dash lines indicate prohibited turns)
9 Reliable and Adaptive Routing Algorithms for 2D and 3D Networks-on-Chip 215
(m-1-x,n-1-y)
to Mad-y [10]
(m+x,n-1-y)
(m-1-x,1+y)
(m+x,1+y)
(m+x,0) (m+1+x,0)
X,Y
(m-x,0) (m-1-x,0)
(m-1-x,n-y)
(m+x,n-y)
(m-1-x,y)
(m+x,y)
deadlock in the network and thus they should be avoided. In turn models, certain
turns are prohibited from each cycle in order to break all cyclic dependencies and
thus avoiding deadlock. In the XY routing algorithm, for example, packets are
routed along the X dimension before proceeding along the Y dimension. As shown
in Fig. 9.1b, in this algorithm, two turns are avoided from each abstract cycle, and
thus there is no possibility of forming a complete cycle among the remaining turns.
We utilize one and two virtual channels along the X and Y dimensions,
respectively, in which four cycles might be formed in the network (Fig. 9.2a). In
order to avoid deadlock, one turn is prohibited from each cycle which is shown in
Fig. 9.2b. The prohibited turns in each virtual channel are taken from the Mad-y
method [28]. Based on this turn model, the turns E-N1, E-S1, N1-E, N2-E, S1-E,
and S2-E are permitted for eastward packets while the turns W-S1, W-S2, W-N1,
W-N2, N2-W, and S2-W are allowable for westward packets.
To prove deadlock-freeness, we use a numbering mechanism similar to the
Mad-y method [28]. This numbering mechanism shows that all the turns have
occurred only in the ascending order, and thus no cycle can be formed in the
network. A two-digit number (a,b) is assigned to each output channel of a router in
an n × m mesh network. According to the numbering mechanism, a turn connecting
the input channel (Ia,Ib) to the output channel (Oa,Ob) is called an ascending
turn when (Oa > Ia) or ((Oa = Ia) and (Ob > Ib)). Figure 9.3 shows the channels’
216 M. Ebrahimi
The fault-tolerant routing algorithms are usually very complex. In contrast, the
proposed algorithms in this chapter are very simple and can be easily implemented.
The basic idea behind these algorithms is to keep the adaptivity of packets as long
as possible. It means that for example if they are two minimal directions to deliver a
packet, a direction should be preferably selected in which from the next router, the
packet has still some alternative paths to reach the destination router. This simple
idea avoids taking unnecessary non-minimal paths and reduces the complexity of
the fault-tolerant algorithms.
Let us consider the examples of Fig. 9.4 where a packet is sent from the current
router C to the destination router D. In Fig. 9.4a, the packet can be sent through two
minimal directions (i.e. E and N) to reach the destination router D. By sending the
packet to either of directions, the packet will have two minimal choices from the
next neighboring routers to reach the destination router. Therefore, if one path is
faulty, there is an opportunity to send a packet through the other path. In Fig. 9.4b,
however, if the packet is sent to the north neighboring router, it will have only
one option to reach the destination router. So it is better to send the packet to the
east direction from where there are two minimal paths toward the destination. In
Fig. 9.4c, there are one and two minimal paths from the east and north neighboring
routers, respectively, to reach the destination router. Thereby, the north direction is
a better choice to deliver the packet. In Fig. 9.4d, there is only one possible choice
to reach the destination router from either the east or north neighboring router (as
9 Reliable and Adaptive Routing Algorithms for 2D and 3D Networks-on-Chip 217
D
2: N 1: N
1: N or E
1: E 2: E
C C C
d e f
D
D
1: N
2: E
C C D C
Fig. 9.4 The basic idea of the RR-2D method (Note that numbers determine the priority of
selecting among different routes)
we will see in Sect. 9.3.3, to distribute less fault information, the possibility of the
north direction is checked earlier than the east direction). In Fig. 9.4e, f, only one
minimal route exists to reach the destination and the packet has to be routed through
this single path. In this case if there is a fault in the path, the packet has to take
a non-minimal route to bypass the fault. In fact, by making a decision similar to
Fig. 9.4b, c, packets do not lose their adaptivity and thus they will not face similar
situations as in Fig. 9.4e, f.
In general, packets are routed inside the network using the permitted turns offered
by a fully adaptive routing algorithm. The adaptivity is limited when the distance
between the current and the destination router reaches one in at least a dimension.
RR-2D avoids reducing the distance into zero in one direction when the distance
along the other direction is greater than one.
1-bit
1-bit
C W C E C
S 3-bit 1-bit
W E S
1-bit 1-bit
1-bit
instant links, but it should be informed about the fault information on the other links
and routers. As shown in Fig. 9.5c, 3-bit wire is used to transfer the fault information
of the north neighboring router and its east and west links to the current router.
Similarly, 3-bit wire is utilized to transfer the information from the south direction.
1-bit wire is enough to transfer the information of the east and west neighboring
routers to the current router.
In this section, we investigate how to tolerate faults using RR-2D. The simple rule is
to check the possibility of sending a packet through the greater-distance dimension
when the distance to the destination router reaches one along a dimension. The
packet is sent to the greater-distance dimension if the instant link and router along
this direction are non-faulty; otherwise the smaller-distance dimension is examined.
In other situations (i.e. distance has not reached a along a dimension), packets have
no routing limitation. Different positions of a faulty link and router for a northeast-
ward packet are illustrated in Fig. 9.6.
As shown in Fig. 9.6a, when the distance along both X and Y dimensions reaches
one (X-dir = 1 and Y-dir = 1), there will be six different positions of faults (two
routers and four links). According to RR-2D, east and north directions have the
same priority to be selected as there are no alternative choices from the next router
to the destination router. However, according to the fault distribution mechanism
(Fig. 9.5c), the current router knows about the faulty links in the NE path and also
the N neighboring router while it does not know about the fault status in the whole
EN path. Therefore, the availabilities of the NE path and N router are checked and if
they are non-faulty the packet is sent to the north direction, otherwise the packet is
possibly sent to the E direction. In the patterns A1, A2, A3, and A6 of Fig. 9.6a, the
NE path and the N router are non-faulty and the packet is sent to the north direction.
In the patterns A4, A5, and A7, the packet is delivered to the east direction as either
the NE path or the N router is faulty.
In Fig. 9.6b, when X-dir = 1 and Y-dir = 2, a fault might occur in seven different
locations of links and four locations of routers. In all patterns, the availabilities of
the N link and the N router are examined before those of the east direction. In the
patterns B1, B2, B3, B4, B5, B6, B7, B9, B10, and B11, the packet is sent to the
9 Reliable and Adaptive Routing Algorithms for 2D and 3D Networks-on-Chip 219
a b
D D D D D
A1 A1 A1 A1 A2
B1 B2 B3 B4
D D C C C C
A2 A3
D D D D
C C
A3 A4 A5
D D
A4 A5 B5 B6 B7 B8
C C C C C C
D D D D
D D
A1 A6 A7
A6 A7
C C
B9 B10 B11 B12
C C C C
c d
D D D D D
C1 A1 C2 A1 C3 A1 C4 A2 A
C C C C
D1 B
D D D D C
C5 A3 C6 A4 C7 A5 C8
C C C C D
C A
D D D D
north direction as both the link and router are non-faulty, while in the next hop,
one of the patterns of Fig. 9.6a arises (i.e. patterns A1 to A7). In the patterns B8
and B12 of Fig. 9.6b, the packet has to be routed to the east direction to reach the
destination router. In Fig. 9.6c, when X-dir = 2 and Y-dir = 1, the availability of
the E link and the E router should be checked before the N link and the N router.
Therefore, in the patterns C1, C2, C3, C4, C5, C6, C7, C9, C10, and C11, the
packet is sent to the east direction as both the E link and the E router are non-
faulty. In the patterns C8 and C12, the packet is delivered to the north direction so
that the fault is bypassed. In all the other cases (when X-dir > =2 and Y-dir > =2
in Fig. 9.6d), the packet is sent to the non-faulty direction. In the next hop, the
patterns are similar to Fig. 9.6b, c. Based on this discussion, to support all single
220 M. Ebrahimi
Fig. 9.7 RR-2D for northeast, northwest, southeast, and southwest packets
faults, only the shortest paths are used. The same perspective can be applied to the
northwest-, southeast-, and southwest-ward packets. Figure 9.7 shows the pseudo
code of the RR-2D routing algorithm covering all these positions.
When the packet is east-, west-, north-, or south- bounded and there is a faulty
link or router in the path, the packet must be rerouted through a non-minimal path
around the fault. As illustrated in Fig. 9.8a, for the eastward packet, at first the east
link and router are checked and if they are non-faulty, the packet is sent to this
direction. However, if either the link or the router is faulty, the packet is delivered to
the north or south direction according to the congestion values. Westward packets do
the same behavior (Fig. 9.8b). If a northward packet faces a fault in the north link
9 Reliable and Adaptive Routing Algorithms for 2D and 3D Networks-on-Chip 221
a b c d
D C
2:N 2:N 2:W 3:E
1:S
1:E 1:W
C D D C
1:N
2:S 2:S
2:W 3:E
C D
Fig. 9.8 Bypassing faults when the destination is located in the (a) east, (b) west, (c) north,
(d) south positions of the source router (Note that numbers determine the priority of selecting
among different routes)
a b
C D C D D C D C
C D C D D C D C
Fig. 9.9 Tolerating a fault by (a) eastward packets (b) westward packets
or router (Fig. 9.8c), the west direction is checked earlier than the east direction.
It means that rerouting through the east direction is done only when the fault is
located in the left borderline. A similar perspective is applied to the southward
packets (Fig. 9.8d).
Now, we need to show that all the required turns for bypassing faults are in the
set of allowable turns. By investigating the required turns, we notice that eastward
packets use the E-N1, N1-E, E-S1, and S1-E turns to bypass a faulty link or router
(Fig. 9.9a) in which all turns are in the set of allowable turns. Similarly, as shown in
Fig. 9.9b, all the required turns by the westward packets are permitted (i.e. W-N2,
N2-W, W-S2, and S2-W).
We should also prove that the northward and southward packets are routed in
the network without creating deadlock. As illustrated in Fig. 9.10a, b, normally
the northward and southward packets use the permitted turns as N2-W, W-N2,
N2-E, S2-W, W-S2, S2-E to bypass faults. However, when the source and desti-
nation routers are located in the left borderline and there is a faulty link or router
in the path, the required turns are N2-E, E-N2, N2-W, S2-E, E-S2, and S2-W.
Among them, the E-N2 and E-S2 turns are unallowable according to our turn models
(Table 9.1), but a complete cycle cannot be formed in borderline cases (as indicated
in [19]) and these unallowable turns can be safely taken. The remaining part of the
RR-2D routing algorithm to tolerate faulty links and routers for east-, west-, north-,
and south- ward packets is shown in Fig. 9.11 (i.e. continuation of Fig. 9.7).
222 M. Ebrahimi
a b c d
D C D ... C ...
... ...
C D C ... D ...
D C D ... C ...
... ...
C D C ... D ...
Fig. 9.11 RR-2D for east-, west-, north-, and south- ward packets
9 Reliable and Adaptive Routing Algorithms for 2D and 3D Networks-on-Chip 223
In this section, we present a Reliable Routing algorithm for 3D stacked mesh, called
RR-3D. We investigate this method for tolerating faulty routers and links. At first,
we present a fully adaptive routing algorithm in a 3D mesh network using two,
two, and four virtual channels. Then we show how packets can switch between
virtual channels to achieve a better fault-tolerant capability. We explain the fault
monitoring and management technique and finally we investigate the proposed
algorithm regarding all positions of a single faulty router and link.
(X2-,Y2-,Z4*) (X2+,Y1-,Z2*)
X+
Y-
packet to one of these directions, the adaptivity will be limited to one path while by
delivering the packet to the other direction there will be at least two shortest paths
to reach the destination. Thereby, it is better to select a path which maintains the
adaptivity of packets. In Fig. 9.14c, when a destination is at the router 4, 10, or 12,
there are again two options to forward the packet while by selecting either option the
adaptivity will be lost, and thus there is not any preference to select between them.
Finally, in Fig. 9.14d, packets have no alternative paths to reach the destination
router 1, 2, 3, 6, 9, or 18.
2
5 U 1 U
3 1-bit
4
7 6
N N
8 3-bit 5-bit
W 10 C 9 E W C E
5-bit
3-bit
13 11 12
S S
15 1-bit
16
18 D 14 D
17
Fig. 9.15 at each given router (i.e. router C). The links include the instant links
connected to each neighboring routers (i.e. 3, 8, 9, 10, 11, and 16). The information
of these instant links is already provided for the given router using a fault-detection
technique. However, the fault information of the other links should be sent to the
given router. These links are as: four links connected to the north neighboring router
(i.e. 2, 6, 7, and 15), four links connected to the south neighboring router (i.e. 4, 12,
13, and 17), two links connected to the east neighboring router (i.e. 1 and 14), and
two links connected to the west neighboring router (i.e. 5 and 18). Moreover, using
a fault-detection technique, a router is informed about the fault status of itself while
the information about the neighboring routers (i.e. north, south, east, west, up, and
down) should be transferred to the given router. The fault management mechanism
is responsible to combine the fault information at each router and transfer it to the
neighboring routers. In this example, either of the north and south neighboring
routers combines the fault information of the router itself and the connected four
links and transfers a 5-bit information to the given router. East and west neighboring
routers transfer 3-bit to the given router; 1-bit for the fault status of the router and
2-bit for the fault statuses of the links. From the up and down directions, 1-bit
information is transferred to the given router which indicates the fault status of the
connected router.
Now, let us investigate which information should be transferred from the current
router to each of the neighboring routers. The fault status of the router C and its four
links 3, 9, 10, and 16 are transferred to the north and south neighboring routers. The
fault status of the router C and its two links 3 and 16 are transferred to the east and
west neighboring routers. Finally, only the fault status of the router C is sent to the
up and down neighboring routers.
In this section, we show how faulty links can be tolerated using RR-3D. When
packets have at least two minimal choices (similar to Fig. 9.14a), they are sent
228 M. Ebrahimi
Fig. 9.16 RR-3D when the distance reaches one along two directions
to a direction with a non-faulty link and router. When the distance reaches one
along two directions, it is important to send a packet through a non-faulty path
as the wrong decision may result in dropping the packet or taking non-minimal
routes. In Fig. 9.16a, the packet can be sent to the north or up direction. If only
the neighboring links and routers are checked, the packet might be sent to the up
direction. However, if the link between the router 9 and the destination router 12 is
faulty, a non-minimal path should be taken. In order to avoid this situation, at first,
the fault status of the NU path and the N router are checked (which are available
by the management mechanism). If they are non-faulty, the packet is sent to the
north direction; otherwise the up direction is selected. Similarly, in Fig. 9.16b, the
statuses of the NE path and the N router are checked and if they are non-faulty, the
packet is sent to the north direction; otherwise the east direction is selected. Finally,
in Fig. 9.16c, the packet is sent to the east direction if the EU path and the E router
are non-faulty; otherwise the up direction is selected.
When the current and destination routers are located in the same dimension and a
link or router between them is faulty, the non-minimal route is necessitated. In order
to avoid taking unnecessary longer paths, RR-3D always tries not to reduce the
distance to zero along two dimensions when the distance along the third dimension
is greater than one. For example, in Fig. 9.16c, if the source and destination are
located at the router 0 and 19, respectively, the availability of the up direction is
checked earlier than the east direction. The reason is that if the packet is sent to the
east dimension and the link 10–19 is faulty, the packet has to take a non-minimal
path. On the other hand, by sending the packet to the router 9, the packet has two
alternative routes to reach the destination, and if one of the routes is faulty, the
packet is sent through the other one.
As already mentioned, when packets are east-, west-, north-, or south-bounded
and there is a fault in the path, they have to take a non-minimal path. RR-3D should
be able to tolerate these faults as well. (It is worth mentioning that if the source
and destination routers are not located in the same dimension, packets never face
these conditions as faults are bypassed prior to reaching them). We take advantage
of the capability of switching between subnetworks. Therefore, when the source and
destination routers are located in a same dimension, packets are started routing in
9 Reliable and Adaptive Routing Algorithms for 2D and 3D Networks-on-Chip 229
Fig. 9.17 RR-3D when there is only a single path between the source and destination routers
the lowest possible subnetwork and then they can switch to a higher subnetwork in
the ascending order if needed. The rules of the RR-3D algorithm are as follows:
• Rule1: If a fault occurs on the Z dimension, packets are rerouted through the
Y dimension.
• Rule2: If a fault occurs on the X or Y dimension, packets are rerouted through
the Z dimension.
For performing these rules, packets may require to change their subnetworks.
Let us consider the example of Fig. 9.17a where the source and destination are
located at the router 4 and 22, respectively, and the router 15 is faulty. Since the
fault has occurred along the Z dimension, according to Rule1, packets are rerouted
through the Y dimension which can be in its positive or negative direction. Let us
assume that rerouting takes place through the positive direction of the Y dimension.
Since the subnetwork 1 covers Y1+ (shown in Fig. 9.13), the packet uses this
channel. Then the packet should be routed along the positive direction of the Z
dimension. The subnetwork 1 also covers Z1+, so the packet is still routed using
230 M. Ebrahimi
the channels of the subnetwork 1. The packet continues along this direction until it
reaches the same layer as the destination router where it should be sent through the
negative direction of the Y dimension. However, this channel is not included in the
subnetwork 1 while the subnetwork 2 covers Y1– so that the packet uses this channel
to reach the destination router. Now, we explain the situation when the packet is sent
along the negative direction of the Y dimension at the router 4. The subnetwork 1
does not cover the negative direction of the Y dimension while the subnetwork 2
covers it, so the packet uses Y1– from this subnetwork. The packet can be routed
along the Z dimension using Z2+ from the same subnetwork. Finally, the positive
direction of the Y dimension is not covered by the subnetwork 2 and the Y2+ is
taken from the subnetwork 3 and the packet reaches the destination. As every router
has at least one neighbor along the Y dimension, faults on vertical connections can
be tolerated by rerouting packets through the Y dimension (see two more examples
in Fig. 9.17a).
Figure 9.17b shows the cases where the fault occurs on the Y links and it is
tolerated by rerouting packets through the Z dimension according to Rule2. We
investigate the example where the source and destination are located at the routers
15 and 9, respectively, and the link 12–9 is faulty. At first the negative direction
of the Y dimension should be taken. Since the subnetwork 1 does not cover it, the
next subnetwork is checked. The subnetwork 2 covers Y1– and thus the packet uses
this channel. For routing along the Z dimension either in the positive or negative
direction, the channels of the subnetwork 2 are used (Z2*). Then the packet should
be routed along the negative direction of the Y dimension which is covered by the
subnetwork 2 (Y1−). Finally, the negative and positive directions of the Z dimension
are covered by the same subnetwork (Z2*).
In Fig. 9.17c, according to Rule2, faults in the X dimension are tolerated by
rerouting packets through the Z dimension. Let us assume the case where the source
and destination are located at the router 11 and 9, respectively, and the link 10–9 is
faulty. The negative direction of the X dimension is not covered by the subnetworks
1 and 2, and thus the X1− from the subnetwork 3 is used. The packet is rerouted
along the Z dimension using Z3+ or Z3− from the same subnetwork. The packet
needs to take the X1− and then Z3+ or Z3− to reach the destination router. All of
these channels are covered by the subnetwork 3. As every router has at least one
neighbor in the Z dimension, faults on the X or Y dimension can be tolerated by
rerouting packets along the Z dimension (see additional examples in Fig. 9.17b, c).
successful packet injection into the network over the total number of injection
attempts. As a performance metric, we use latency defined as the number of cycles
between the initiation of a packet issued by a processing element and the time when
the packet is completely delivered to the destination.
For evaluating the performance in a 2D mesh network, RR-2D is compared
with a reconfigurable routing presented in [19] (in simulation we call this method
ReRS). As discussed in the related works, ReRS does not require any virtual channel
and it is able to tolerate all locations of a single faulty router. However, using
ReRS, unnecessary longer paths are taken to tolerate faults which results in creating
congestion around the faulty regions. In addition, ReRS is based on deterministic
routing and thus packets cannot be well distributed over the network. RR-2D is our
proposed method which is based on a fully adaptive routing and utilizes one and
two virtual channels along the X and Y dimensions, respectively. RR-2D is able to
tolerate all locations of a single faulty link or router. To have a fair comparison we
use the same number of virtual channels in both methods and for this purpose, an
extra virtual channel is added to the ReRS approach. This virtual channel is used for
the performance purposes.
For measuring the performance in a 3D mesh network, the proposed method
(RR-3D) is compared with HamFA [27]. HamFA is a fault-tolerant method toler-
ating almost all one-faulty unidirectional links. It is able to tolerate faults either
on vertical or horizontal link without using virtual channels. HamFA is a partially
adaptive routing algorithm. On the other hand, RR-3D can tolerate both faulty links
and routers while guaranteeing to tolerate all single faults wherever in the network.
RR-3D is built upon a fully adaptive routing algorithm requiring two, two, and four
virtual channels. To have a fair comparison, the same number of virtual channels is
used in both methods.
We perform two sets of simulations: 1- to measure the performance of the
proposed methods against the baseline methods in both 2D and 3D mesh networks.
2- to measure the reliability of the proposed method compared with baseline
methods. In both sets of simulations, we perform the experiments on an 8 × 8 and a
4 × 4 × 4 mesh network. For performance analysis, the simulator is warmed up for
20,000 cycles and then the average performance is measured over another 200,000
cycles.
In the uniform traffic profile, each processing element generates data packets and
sends them to another processing element using a uniform distribution [29]. In
Fig. 9.18a, the average communication latencies of RR-2D and ReRS are measured
for fault-free and a single faulty router cases. In addition, the performance of RR-
2D is measured under six faults in the network, each randomly chosen to be a faulty
link or router. As observed from the results, in fault-free cases, the ReRS method
is performing the best as it is based on deterministic routing (i.e. similar to XY
232 M. Ebrahimi
100
0
0 0,1 0,2 0,3 0,4 0,5
Injection Rate (flits/router/cycles)
routing) which is well suited to uniform traffic. When a single fault occurs in the
network, the performance of the ReRS method is considerably decreased while the
RR-2D method maintains the performance in the presence of faults even if there
are six faults in the network. The reason is that not only RR-2D avoids taking
unnecessary non-minimal routes but also it is based on a fully adaptive routing and
thus performing well in distributing packets over different routes.
In Fig. 9.18b, the average communication delay of RR-3D and HamFA schemes
is plotted. RR-3D outperforms HamFA in the uniform traffic. Due to similar reasons
as RR-2D, the network performance is maintained under the presence of faults in
the network.
Under the hotspot traffic pattern, one or more routers are chosen as hotspots,
receiving an extra portion of the traffic in addition to the regular uniform traffic.
Given a hotspot percentage of H, a newly generated packet is directed to each
hotspot router with an additional H percent probability. We simulate the hotspot
traffic H = 10% with a single hotspot router at (4,4) in an 8 × 8 network and two
hotspot routers at positions (2,1,2) and (3,2,2) in a 4 × 4 × 4 network, respectively.
In Fig. 9.19a, the performance of RR-2D and ReRS is measured for fault-free and
9 Reliable and Adaptive Routing Algorithms for 2D and 3D Networks-on-Chip 233
a
350 ReRS: 0-fault
100
50
0
0 0,05 0,1 0,15 0,2 0,25
Injection Rate (flits/router/cycles)
b 500
HamFA: 0 fault
Average Latency (cycle)
HamFA: 1 fault
400
RR-3D: 0 fault
300 RR-3D: 1 fault
RR-3D: 6 fault
200
100
0
0 0,05 0,1 0,15 0,2 0,25 0,3 0,35
Injection Rate (flits/router/cycles)
Fig. 9.19 Performance under hotspot traffic profile in (a) 2D network, (b) 3D network
one-faulty router cases while the performance of RR-2D is shown under six faults as
well (mixture of faulty routers and links). RR-2D performs considerably better than
ReRS even under the presence of six faults. This is due to the fact that RR-2D is
an adaptive method and can balance the traffic over the network. On the other hand,
RR-2D avoids using non-minimal routes as possible. The performance of the RR-
3D and HamFA methods is illustrated in Fig. 9.19b. As observed from this figure,
RR-3D leads to the best performance in fault-free cases. When there are six faults
in the network, the performance is still high and it means RR-3D performs well in
tolerating faults while maintaining the performance.
As ReRS is able to tolerate faulty routers in its basic form, we measure the reliability
of it by disabling only routers. RR-2D can tolerate both faulty routers and links
and the reliability is measured under the mixture of them. We increase the number
of faults from 1 to 6. All faults are selected using a random function. A network
is reliable if all the injected packets reach their destinations. In other words, the
234 M. Ebrahimi
a RR-2D ReRS
Reliability (%) 100
80
60
40
20
0
1-fault 2-fault 3-fault 4-fault 5-fault 6-fault
b RR-3D HamFA
100
80
Reliability (%)
60
40
20
0
1-fault 2-fault 3-fault 4-fault 5-fault 6-fault
network is counted as unreliable even if all packets reach the destinations except
one packet. As shown in Fig. 9.20a, RR-2D can tolerate up to six faults by more
than 40% reliability.
For a 3D network, we compare the reliability of RR-3D with HamFA. As HamFA
is designed for tolerating unidirectional faulty links, its reliability is also measured
based on these kinds of faults. On the other hand, RR-3D tolerates both faulty links
and routers and the reliability value is obtained under the presence of both kinds
of faults. We inject 1 to 6 faults into the network (mixture of faulty routers and
links) to measure the reliability of RR-3D while 1 to 6 unidirectional faulty links are
injected to measure the reliability of HamFA. As shown in Fig. 9.20b, the reliability
of HamFA decreases significantly in comparison with RR-3D. Using RR-3D by the
probability of 18%, the network performs normally without any packet loss when
there are six faults in the network.
the operating frequency of 500 MHz and supply voltage of 1V. We perform
place-and-route, using Cadence Encounter, to have precise power and area estima-
tions. The power dissipation is calculated using Synopsys PrimePower in an 8 × 8
mesh network. According to our analysis the area overheads of RR-2D and ReRS are
almost the same as both of them use a simple routing unit and a similar number of
wires to distribute the fault information. The power consumption of ReRS is slightly
larger than RR-2D since RR-2D takes advantages of the fully adaptive routing and
also utilizes the shortest paths as possible.
The whole platforms of the HamFA and RR-3D methods are also synthesized
by Synopsys Design Compiler. The same numbers of channels are used in both
methods and two faulty links are injected into the network. Depending on the
technology and manufacturing process, the pitch of TSVs can range from 1 to
10 μm. In this work, the pad size for TSVs is assumed to be 5 μm square with
pitch of around 8 μm. The layout areas of the HamFA and RR-3D schemes are
almost similar while the area overhead of RR-3D is slightly larger than HamFA. This
small difference is because of employing the monitoring and management technique
which does not exist in the HamFA method. The average power consumption of the
RR-3D scheme is 8% less than that of the HamFA scheme as it is able to balance the
traffic over the network and delivering packets to destinations through the shortest
paths as possible.
References
10.1 Introduction
1 One recent estimate indicates that static power (of buffers and links) could constitute 80–90 % of
interconnect power in future systems [7].
2 In a conventional bufferless deflection network, flits (several of which make up one packet) are
independently routed, unlike most buffered networks, where a packet is the smallest independently-
routed unit of traffic.
10 Bufferless and Minimally-Buffered Deflection Routing 243
10.2 Background
buffer every flit that enters the router from an input port before the flits can arbitrate
for output ports. Dally and Towles [12] provide a good reference on these routers.
NoCs in cache-coherent CMPs: On-chip networks form the backbone of memory
systems in many recently-proposed and prototyped large-scale CMPs (chip mul-
tiprocessors) [28, 45, 50]. Most such systems are cache-coherent shared memory
multiprocessors. Packet-switched interconnect has served as the substrate for large
cache-coherent systems for some time (e.g., for large multiprocessor systems such
as SGI Origin [34]), and the principles are the same in a chip multiprocessor: each
core, slice of a shared cache, or memory controller is part of one “node” in the
network, and network nodes exchange packets that request and respond with data
in order to fulfill memory accesses. A diagram of a typical system is shown in
Fig. 10.1. For example, on a miss, a core’s private cache might send a request packet
to a shared L2 cache slice, and the shared cache might respond with a larger packet
containing the requested cache block on an L2 hit, or might send another packet to a
memory controller on an L2 miss. CMP NoCs are typically used to implement such
a protocol between the cores, caches and memory controllers.
Bufferless deflection routing has found renewed interest in NoC design because
on-chip wires (hence, network links) are relatively cheap, in contrast to buffers,
which consume significant die area and leakage power [4, 5, 7, 29, 38]. Several
evaluations of bufferless NoC design [17, 26, 29, 38] have demonstrated that
removing the buffers in NoC routers, and implementing a routing strategy which
operates without the need for buffers (such as the one we describe below), yield
energy-efficiency improvements because occasional extra link traversals due to
deflections consume relatively less energy than the dynamic energy used to buffer
traffic at every network hop and the static energy consumed whenever a buffer is
turned on. (Our motivational experiments in Sect. 10.3 demonstrate the performance
and energy impact of such a network design in more detail.) Although other
solutions exist to reduce the energy consumption of buffers, such as dynamic buffer
bypassing [37, 49] (which we also incorporate into our baseline buffered-router
design in this chapter), bufferless deflection routing achieves additional savings in
energy and area by completely eliminating the buffers.
One recent work proposed BLESS [38], a router design that implements buffer-
less deflection routing, which we describe here. The fundamental unit of routing in
a BLESS network is the flit, a packet fragment transferred by one link in one cycle.
Flits are routed independently in BLESS.3 Because flits are routed independently,
they must be reassembled after they are received. BLESS assumes the existence of
sufficiently-sized reassembly buffers at each node in order to reconstruct arriving
flits into packets. (Later work, CHIPPER [17], addresses the reassembly problem
explicitly, as we discuss below.)
Deflection Routing Arbitration: The basic operation of a BLESS bufferless
deflection router is simple. In each cycle, flits arriving from neighbor routers enter
the router pipeline. Because the router contains no buffers, flits are stored only in
pipeline registers, and must leave the router at the end of the pipeline. Thus, the
router must assign every input flit to some output port. When two flits request the
same output port according to their ordinary routing function, the router deflects one
of them to another port (this is always possible, as long as the router has as many
outputs as inputs). BLESS performs this router output port assignment in two stages:
flit ranking and port selection [38]. In each cycle, the flits that arrive at the router
are first ranked in a priority order (chosen in order to ensure livelock-free operation,
as we describe below). At the same time, the router computes a list of productive
output ports (i.e., ports which would send the flit closer to its destination) for each
flit. Once the flit ranking and each flits’ productive output ports are available, the
router assigns a port to each flit, starting from the highest-ranked flit and assigning
ports to flits one at a time. Each flit obtains a productive output port if one is still
available, and is “deflected” to any available output port otherwise. Because there
are as many output ports as input ports, and only the flits arriving on the input ports
in a given cycle are considered, this process never runs out of output ports and can
always assign each flit to some output. Hence, no buffering is needed, because every
flit is able to leave the router at the end of the router pipeline.
Livelock freedom in BLESS: Although a BLESS router ensures that a flit is
always able to take a network hop to some other router, a deflection takes a flit
further from its destination, and such a flit will have to work its way eventually
to its destination. In such a network design, explicit care must be taken to ensure
that all flits eventually arrive at their destinations (i.e., that no flit circles, or gets
stuck, in the network forever). This property is called livelock freedom. Note that
conventional virtual channel-buffered routers, which buffer flits at every network
hop, are livelock-free simply because they never deflect flits: rather, whenever a flit
leaves a router and traverses a link, it always moves closer toward its destination
(this is known as minimal routing [12]).
BLESS ensures livelock freedom by employing a priority scheme called Oldest-
First [38]. Oldest-First prioritization is a total order over all flits based on each flit’s
age (time it has spent in the network). If two flits have the same age (entered the
network in the same cycle), then the tie is broken with other header fields (such as
sender ID) which uniquely identify the flit. This total priority order leads to livelock-
free operation in a simple way: there must be one flit which is the oldest, and thus
has the highest priority. This flit is always prioritized during flit-ranking at every
router it visits. Thus, it obtains its first choice of output port and is never deflected.
Because it is never deflected, the flit always moves closer toward its destination, and
will eventually arrive. Once it arrives, it is no longer contending with other flits in
the network, and some other flit is the oldest flit. The new oldest flit is guaranteed
to arrive likewise. Inductively, all flits eventually arrive.
Flit injection and ejection: A BLESS router must inject new flits into the network
when a node generates a packet, and it must remove a flit from the network when
the flit arrives at its destination. A BLESS router makes a local decision to inject a
flit whenever, in a given cycle, there is an empty slot on any of its input ports [38].
The router has an injection queue where flits wait until this injection condition is
met. When a node is not able to inject, it is starved; injection starvation is a useful
proxy for network congestion which has been used to control congestion-control
mechanisms in bufferless deflection networks [8, 39, 40].
When a flit arrives at its destination router, that router removes the flit from the
network and places it in a reassembly buffer, where it waits for the other flits from its
packet to arrive. Flits in a packet may arrive in any order because each flit is routed
independently, and might take a different path than the others due to deflections.
Once all flits in a packet have arrived in that packet’s reassembly buffer, the packet
is delivered to the local node (e.g., core, cache, or memory controller). A BLESS
router can eject up to one flit per cycle from its inputs to its reassembly buffer.
Figure 10.2 depicts the reassembly buffers as well as the injection queue of a node
in a BLESS NoC.
10 Bufferless and Minimally-Buffered Deflection Routing 247
B B0 B1 B2 E2
C C0 C2 C3 E1
D D0 D1 D2 E0
Ejected Injected
Flits Flits
Router
CHIPPER [17], another bufferless deflection router design, was proposed to address
implementation complexities in prior bufferless deflection routers (e.g., BLESS).
The CHIPPER router has smaller and simpler deflection-routing logic than BLESS,
which leads to a shorter critical path, smaller die area and lower power.
The Oldest-First age-based arbitration in BLESS leads to slow routers with large
hardware footprint [17, 26, 37] for several reasons, which we describe here.
Deflection arbitration: First, implementing deflection arbitration in the way
that BLESS specifies leads to complex hardware. Routers that use Oldest-First
arbitration must sort input flits by priority (i.e., age) in every cycle. This requires
a three-stage sorting network for four inputs. Then, the router must perform port
assignment in priority order, giving higher-priority flits first choice. Because a
lower-priority flit might be deflected if a higher priority-flit takes an output port
first, flits must be assigned output ports sequentially. This sequential port allocation
leads to a long critical path, hindering practical implementation. This critical path
(through priority sort and sequential port allocation) is illustrated in Fig. 10.3.
Packet reassembly: Second, as noted above, BLESS makes use of reassembly
buffers to reassemble flits into packets. Reassembly buffers are necessary because
each flit is routed independently and may take a different path than the others in a
248 C. Fallin et al.
packet, arriving at a different time. Moscibroda and Mutlu [38] evaluate bufferless
deflection routing assuming a large enough reassembly buffer, and report maximum
buffer occupancy.
However, with a limited reassembly buffer that is smaller than a certain size,
deadlock will occur in the worst case (when all nodes send a packet simultaneously
to a single node). To see why this is the case, observe the example in Fig. 10.4
(figure taken from Fallin et al. [17]). When a flit arrives at the reassembly buffers in
Node 0, the packet reassembly logic checks whether a reassembly slot has already
been allocated to the packet to which this flit belongs. If not, a new slot is allocated,
if available. If the packet already has a slot, the flit is placed into its proper location
within the packet-sized buffer. When no slots are available and a flit from a new
packet arrives, the reassembly logic must prevent the flit from being ejected out of
the network. In the worst case, portions of many separate packets arrive at Node 0,
allocating all its slots. Then, flits from other packets arrive, but cannot be ejected,
because no reassembly slots are free. These flits remain in the network, deflecting
and retrying ejection. Eventually, the network will fill with these flits. The flits which
are required to complete the partially-reassembled packets may have not yet been
injected at their respective sources, and they cannot be injected, because the network
is completely full. Thus, deadlock occurs. Without a different buffer management
10 Bufferless and Minimally-Buffered Deflection Routing 249
scheme, the only way to avoid this deadlock is to size the reassembly buffer at each
node for the worst case when all other nodes in the system send a packet to that node
simultaneously. A bufferless deflection router implementation with this amount of
buffering would have significant overhead, unnecessarily wasting area and power.
Hence, an explicit solution is needed to ensure deadlock-free packet reassembly in
practical designs.
4 CHIPPER assumes that all routers are in a single clock domain, hence can maintain synchronized
golden packet IDs simply by counting clock ticks.
250 C. Fallin et al.
Fig. 10.5 CHIPPER router microarchitecture: router pipeline (left) and detail of a single arbiter
block (right)
The most important consequence of Golden Packet is that each router only needs
to correctly route the highest-priority flit. This is sufficient to ensure that the first
outstanding flit of the Golden Packet is delivered within L cycles. Because the
packet will periodically become Golden until delivered, all of its flits are guaranteed
delivery.
Because Golden Packet prioritization provides livelock freedom as long as the
highest-priority flit is correctly routed, the deflection routing (arbitration) logic does
not need to sequentially assign each flit to the best possible port, as the BLESS
router’s deflection routing logic does (Fig. 10.3). Rather, it only needs to recognize a
golden flit, if one is present at the router inputs, and route that flit correctly if present.
All other deflection arbitration is best-effort. Arbitration can thus be performed more
quickly with simpler logic.
We now describe the CHIPPER router’s arbitration logic here; the router’s
pipeline is depicted in Fig. 10.5 (see Fallin et al. [17] for more details, including
the ejection/injection logic which is not described here). The CHIPPER router’s
arbitration logic is built with a basic unit, the two-input arbiter block, shown on the
right side of Fig. 10.5. Each two-input arbiter block receives up to two flits every
cycle and routes these two flits to its outputs. In order to route its input flits, the
two-input arbiter block chooses one winning flit. If a golden flit is present, the
golden flit is the winning flit (if two golden flits are present, the tie is broken as
described by the prioritization rules). If no golden flit is present, one of the input
flits is chosen randomly to be the winning flit. The two-input arbiter block then
examines the winning flit’s destination, and sends this flit toward the arbiter block’s
output which leads that flit closer to its destination. The other flit, if present, must
then take the remaining arbiter block output.
10 Bufferless and Minimally-Buffered Deflection Routing 251
The CHIPPER router performs deflection arbitration among four input flits (from
the four inputs in a 2D mesh router) using a permutation network of four arbiter
blocks, connected in two stages of two blocks each, as shown in the permute pipeline
stage of Fig. 10.5. The permutation network allows a flit from any router input
to reach any router output. When flits arrive, they arbitrate in the first stage, and
winning flits are sent toward the second-stage arbiter block which is connected to
that flit’s requested router output. Then, in the second stage, flits arbitrate again.
As flits leave the second stage, they proceed directly to the router outputs via a
pipeline register (no crossbar is necessary, unlike in conventional router designs).
This two-stage arbitration has a shorter critical path than the sequential scheme
used by a BLESS router because the arbiter blocks in each stage work in parallel,
and because (unlike in a BLESS arbiter) the flits need not be sorted by priority first.
The arbiter-block permutation network cannot perform all possible flit permutations
(unlike the BLESS router’s routing logic), but because a golden flit (if present) is
always prioritized, and hence always sent to a router output which carries the flit
closer to its destination, the network is still livelock-free. Because the permutation
network (i) eliminates priority sorting, and (ii) partially parallelizes port assignment,
the router critical path is improved (reduced) by 29.1 %, performing within 1.1 % of
a conventional buffered router design [17].
Addressing packet reassembly deadlock with Retransmit-Once: Fallin et
al. [17] observe that the reassembly deadlock problem is fundamentally due to
a lack of global flow control. Unlike buffered networks, which can pass tokens
upstream to senders to indicate whether downstream buffer space is available,
a bufferless deflection network has no such backpressure. Allowing receivers to
exert backpressure on senders solves the problem. Thus, CHIPPER introduces a
new low-overhead flow control protocol, Retransmit-Once, as its second major
contribution.
Retransmit-Once opportunistically assumes that buffer space will be available,
imposing no network overhead in the common case. When no space is available, any
subsequent arriving packet is dropped at the receiver. However, the receiver makes
note of this dropped packet. Once reassembly buffer space becomes available, the
reassembly logic in the receiver reserves buffer space for the previously dropped
packet, and the receiver then requests a retransmission from the sender. Thus, at
most one retransmission is necessary for any packet. In addition, by dropping only
short request packets (which can be regenerated from a sender’s request state), and
using reservations to ensure that longer data packets are never dropped, Retransmit-
Once ensures that senders do not have to buffer data for retransmission. In our
evaluations of realistic workloads, retransmission rate is 0.021 % maximum with
16-packet reassembly buffers, hence the performance impact is negligible. Fallin
et al. [17] describe the Retransmit-Once mechanism in more detail and report that it
can be implemented with very little overhead by integrating with cache-miss buffers
(MSHRs) in each node.
252 C. Fallin et al.
Previous NoC designs based on bufferless deflection routing, such as BLESS [38]
and CHIPPER [17] which we just introduced, were motivated largely by the
observation that many NoCs in CMPs are over-provisioned for the common-
case network load. In this case, a bufferless network can attain nearly the same
application performance while consuming less power, which yields higher energy
efficiency. We now examine the buffered-bufferless comparison in more detail.
Figure 10.6 shows (i) relative application performance (weighted speedup: see
Sect. 10.5), and (ii) relative energy efficiency (performance per watt), when using a
bufferless network, compared to a conventional buffered network. Both plots show
these effects as a function of network load (average injection rate). Here we show
a virtual channel buffered network (4 VCs, 4 flits/VC) (with buffer bypassing) and
the CHIPPER bufferless deflection network [17] in a 4 × 4-mesh CMP (details on
methodology are in Sect. 10.5).
For low-to-medium network load, a bufferless network has performance close
to a conventional buffered network, because the deflection rate is low: thus, most
flits take productive network hops on every cycle, just as in the buffered network.
In addition, the bufferless router has significantly reduced power (hence improved
energy efficiency), because the buffers in a conventional router consume significant
power. However, as network load increases, the deflection rate in a bufferless
deflection network also rises, because flits contend with each other more frequently.
With a higher deflection rate, the dynamic power of a bufferless deflection network
rises more quickly with load than dynamic power in an equivalent buffered network,
1.2
0.8
0 0.2 0.4 0.6
Injection Rate (flits/node/cycle)
Relative Energy Efficiency
0
0 0.2 0.4 0.6
Injection Rate (flits/node/cycle)
Fig. 10.6 System performance and energy efficiency (performance per watt) of bufferless deflec-
tion routing, relative to conventional input-buffered routing (4 VCs, 4 flits/VC) that employs buffer
bypassing, in a 4 × 4 2D mesh. Injection rate (X axis) for each workload is measured in the baseline
buffered network
10 Bufferless and Minimally-Buffered Deflection Routing 253
because each deflection incurs some extra work. Hence, bufferless deflection
networks lose their energy-efficiency advantage at high load. Just as important, the
high deflection rate causes each flit to take a longer path to its destination, and this
increased latency reduces the network throughput and system performance.
Overall, neither design obtains both good performance and good energy effi-
ciency at all loads. If the system usually experiences low-to-medium network load,
then the bufferless design provides adequate performance with low power (hence
high energy efficiency). But, if we use a conventional buffered design to obtain high
performance, then energy efficiency is poor in the low-load case, and even buffer
bypassing does not remove this overhead because buffers consume static power
regardless of use. Finally, simply switching between these two extremes at a per-
router granularity, as previously proposed [29], does not address the fundamental
inefficiencies in the bufferless routing mode, but rather, uses input buffers for all
incoming flits at a router when load is too high for the bufferless mode (hence
retains the relative energy-inefficiency of buffered operation at high load). We
now introduce MinBD, the minimally-buffered deflection router, which combines
bufferless and buffered routing to reduce this overhead.
network temporarily into this side buffer, and given a second chance to arbitrate for
a productive router output when re-injected. This reduces the network’s deflection
rate (hence improves performance and energy efficiency) while buffering only a
fraction of traffic.
We will describe the operation of the MinBD router in stages. First, Sect. 10.4.1
describes the deflection routing logic that computes an initial routing decision for the
flits that arrive in every cycle. Then, Sect. 10.4.2 describes how the router chooses
to buffer some (but not all) flits in the side buffer. Section 10.4.3 describes how
buffered flits and newly-generated flits are injected into the network, and how a flit
that arrives at its destination is ejected. Finally, Sect. 10.4.4 discusses correctness
issues, and describes how MinBD ensures that all flits are eventually delivered.
The MinBD router pipeline is shown in Fig. 10.7. Flits travel through the pipeline
from the inputs (on the left) to outputs (on the right). We first discuss the deflection
routing logic, located in the Permute stage on the right. This logic implements
deflection routing: it sends each input flit to its preferred output when possible,
deflecting to another output otherwise.
MinBD uses the deflection logic organization first proposed in CHIPPER [17].
The permutation network in the Permute stage consists of two-input blocks arranged
into two stages of two blocks each. This arrangement can send a flit on any input
to any output. (Note that it cannot perform all possible permutations of inputs to
outputs, but as we will see, it is sufficient for correct operation that at least one flit
obtains its preferred output.) In each two-input block, arbitration logic determines
which flit has a higher priority, and sends that flit in the direction of its preferred
output. The other flit at the two-input block, if any, must take the block’s other
output. By combining two stages of this arbitration and routing, deflection arises as
a distributed decision: a flit might be deflected in the first stage, or the second stage.
Restricting the arbitration and routing to two-flit subproblems reduces complexity
and allows for a shorter critical path, as demonstrated in [17].
10 Bufferless and Minimally-Buffered Deflection Routing 255
Ruleset 2 MinBD prioritization rules (based on Golden Packet [17] with new rule 3)
Given: two flits, each Golden, Silver, or Ordinary. (Only one can be Silver.)
1. Golden Tie: Ties between two Golden flits are resolved by sequence number (first in Golden
Packet wins).
2. Golden Dominance: If one flit is Golden, it wins over any Silver or Ordinary flits.
3. Silver Dominance: Silver flits win over Ordinary flits.
4. Common Case: Ties between Ordinary flits are resolved randomly.
In order to ensure correct operation, the router must arbitrate between flits so
that every flit is eventually delivered, despite deflections. MinBD adapts a modified
version of the Golden Packet priority scheme [17], which solves this livelock-
freedom problem. This priority scheme is summarized in Ruleset 2. The basic idea
of the Golden Packet priority scheme is that at any given time, at most one packet
in the system is golden. The flits of this golden packet, called “golden flits,” are
prioritized above all other flits in the system (and contention between golden flits
is resolved by the flit sequence number). While prioritized, golden flits are never
deflected by non-golden flits. The packet is prioritized for a period long enough
to guarantee its delivery. Finally, this “golden” status is assigned to one globally-
unique packet ID (e.g., source node address concatenated with a request ID), and
this assignment rotates through all possible packet IDs such that any packet that is
“stuck” will eventually become golden. In this way, all packets will eventually be
delivered, and the network is livelock-free. (See [17] for the precise way in which
the Golden Packet is determined; MinBD uses the same rotation schedule.)
However, although Golden Packet arbitration provides correctness, a perfor-
mance issue occurs with this priority scheme. Consider that most flits are not golden:
the elevated priority status provides worst-case correctness, but does not impact
common-case performance (prior work reported over 99 % of flits are delivered
without becoming golden [17]). However, when no flits are golden and ties are
broken randomly, the arbitration decisions in the two permutation network stages
are not coordinated. Hence, a flit might win arbitration in the first stage, and cause
another flit to be deflected, but then lose arbitration in the second stage, and also be
deflected. Thus, unnecessary deflections occur when the two permutation network
stages are uncoordinated.
In order to resolve this performance issue, we observe that it is enough to ensure
that in every router, at least one flit is prioritized above the others in every cycle.
In this way, at least one flit will certainly not be deflected. To ensure this when no
golden flits are present, MinBD adds a “silver” priority level, which wins arbitration
over common-case flits but loses to the golden flits. One silver flit is designated
randomly among the set of flits that enter a router at every cycle (this designation
is local to the router, and not propagated to other routers). This modification helps
to reduce deflection rate. Prioritizing a silver flit at every router does not impact
correctness, because it does not deflect a golden flit if one is present, but it ensures
that at least one flit will consistently win arbitration at both stages. Hence, deflection
rate is reduced, improving performance.
256 C. Fallin et al.
So far, we have considered the flow of flits from router input ports (i.e., arriving
from neighbor routers) to router output ports (i.e., to other neighbor routers). A flit
must enter and leave the network at some point. To allow traffic to enter (be injected)
and leave (be ejected), the MinBD router contains injection and ejection blocks in
its first pipeline stage (see Fig. 10.7). When a set of flits arrives on router inputs,
these flits first pass through the ejection logic. This logic examines the destination
of each flit, and if a flit is addressed to the local router, it is removed from the router
pipeline and sent to the local network node.5 If more than one locally-addressed flit
is present, the ejection block picks one, according to the same priority scheme used
by routing arbitration.
However, ejecting a single flit per cycle can produce a bottleneck and cause
unnecessary deflections for flits that could not be ejected. In the workloads we
evaluate, at least one flit is eligible to be ejected 42.8 % of the time. Of those cycles,
5 Notethat flits are reassembled into packets after ejection. To implement this reassembly, we use
the Retransmit-Once scheme, as used by CHIPPER and described in Sect. 10.2.2.2.
10 Bufferless and Minimally-Buffered Deflection Routing 257
20.4 % of the time, at least two flits are eligible to be ejected. Hence, in ∼8.5 % of all
cycles, a locally-addressed flit would be deflected rather than ejected if only one flit
could be ejected per cycle. To avoid this significant deflection-rate penalty, MinBD
doubles the ejection bandwidth. To implement this, a MinBD router contains two
ejection blocks. Each of these blocks is identical, and can eject up to one flit per
cycle. Duplicating the ejection logic allows two flits to leave the network per cycle
at every node.6
After locally-addressed flits are removed from the pipeline, new flits are allowed
to enter. There are two injection blocks in the router pipeline shown in Fig. 10.7:
(i) re-injection of flits from the side buffer, and (ii) injection of new flits from the
local node. (The “Redirection” block prior to the injection blocks will be discussed
in the next section.) Each block operates in the same way. A flit can be injected into
the router pipeline whenever one of the four inputs does not have a flit present in a
given cycle, i.e., whenever there is an “empty slot” in the network. Each injection
block pulls up to one flit per cycle from an injection queue (the side buffer, or the
local node’s injection queue), and places a new flit in the pipeline when a slot is
available. Flits from the side buffer are re-injected before new traffic is injected
into the network. However, note that there is no guarantee that a free slot will be
available for an injection in any given cycle. We now address this starvation problem
for side buffer re-injection.
When a flit enters the side buffer, it leaves the router pipeline, and must later be
re-injected. As we described above, flit re-injection must wait for an empty slot on
an input link. It is possible that such a slot will not appear for a long time. In this
case, the flits in the side buffer are delayed unfairly while other flits make forward
progress.
To avoid this situation, MinBD implements buffer redirection. The key idea
of buffer redirection is that when this side buffer starvation problem is detected,
one flit from a randomly-chosen router input is forced to enter the side buffer.
Simultaneously, the flit at the head of the side buffer is injected into the slot created
by the forced flit buffering. In other words, one router input is “redirected” into
the FIFO buffer for one cycle, in order to allow the buffer to make forward progress.
This redirection is enabled for one cycle whenever the side buffer injection is starved
(i.e., has a flit to inject, but no free slot allows the injection) for more than some
6 Forfairness, because dual ejection widens the datapath from the router to the local node (core or
cache), we also add dual ejection to the baseline bufferless deflection network and input-buffered
network when we evaluate performance, but not when we evaluate the power, area, or critical path
of these baselines.
258 C. Fallin et al.
threshold Cthreshold cycles (in our evaluations, Cthreshold = 2). Finally, note that if a
golden flit is present, it is never redirected to the buffer, because this would break
the delivery guarantee.
MinBD provides livelock-free delivery of flits using Golden Packet and buffer
redirection. If no flit is ever buffered, then Golden Packet [17] ensures livelock
freedom (the “silver flit” priority never deflects any golden flit, hence does not break
the guarantee). Now, we argue that adding side buffers does not cause livelock.
First, the buffering logic never places a golden flit in the side buffer. However, a
flit could enter a buffer and then become golden while waiting. Redirection ensures
correctness in this case: it provides an upper bound on residence time in a buffer
(because the flit at the head of the buffer will leave after a certain threshold time
in the worst case). If a flit in a buffer becomes golden, it only needs to remain
golden long enough to leave the buffer in the worst case, then progress to its
destination. MinBD chooses the threshold parameter (Cthreshold ) and golden epoch
length so that this is always possible. More details can be found in our extended
technical report [18].
MinBD achieves deadlock-free operation by using Retransmit-Once [17], which
ensures that every node always consumes flits delivered to it by dropping flits when
no reassembly/request buffer is available. This avoids packet-reassembly deadlock
(as described in [17]), as well as protocol level deadlock, because message-class
dependencies [25] no longer exist.
and the request is always assumed to hit and return data. This potentially increases
network load relative to a real system, where off-chip memory bandwidth can also
be a bottleneck. However, note that this methodology is conservative: because
MinBD degrades performance relative to the buffered baseline, the performance
degradation that we report is an upper bound on what would occur when other
bottlenecks are considered. We choose to perform our evaluations this way in order
to study the true capacity of the evaluated networks (if network load is always low
because system bottlenecks such as memory latency are modeled, then the results
do not give many insights about router design). Note that the cache hierarchy details
(L1 and L2 access latencies, and MSHRs) are still realistic. We remove only the
off-chip memory latency/bandwidth bottleneck.
Baseline Routers: We compare MinBD to a conventional input-buffered virtual
channel router [12] with buffer bypassing [37, 49], a bufferless deflection router
(CHIPPER [17]), and a hybrid bufferless-buffered router (AFC [29]). In particular,
we sweep buffer size for input-buffered routers. We describe a router with m virtual
channels (VCs) per input and n flits of buffer capacity per VC as an (m, n)-buffered
router. We compare to a (8, 8), (4, 4), and (4, 1)-buffered routers in our main results.
The (8, 8) point represents a very large (overprovisioned) baseline, while (4, 4)
is a more reasonable general-purpose configuration. The (4, 1) point represents
the minimum buffer size for deadlock-free operation (two message classes [25],
times two to avoid routing deadlock [10]). Furthermore, though 1-flit VC buffers
would reduce throughput because they do not cover the credit round-trip latency,
260 C. Fallin et al.
7 410.bwaves, 416.gamess and 434.zeusmp were excluded because we were not able to collect
representative traces from these applications.
10 Bufferless and Minimally-Buffered Deflection Routing 261
10.6 Evaluation
In this section, we evaluate MinBD against a bufferless deflection router [17] and an
input-buffered router with buffer bypassing [37,49], as well as a hybrid of these two,
AFC [29], and demonstrate that by using a combination of deflection routing and
buffering, MinBD achieves performance competitive with the conventional input-
buffered router (and higher than the bufferless deflection router), with a smaller
buffering requirement, and better energy efficiency than all prior designs.
Figure 10.8 (top pane) shows application performance as weighted speedup for
4 × 4 (16-node) and 8 × 8 (64-node) CMP systems. The plots show average results
for each workload category, as described in Sect. 10.5, as well as overall average
results. Each bar group shows the performance of three input-buffered routers: 8
VCs with 8 flits/VC, 4 VCs with 4 flits/VC, and 4 VCs with 1 flit/VC. Next is
CHIPPER, the bufferless deflection router, followed by AFC [29], a coarse-grained
hybrid router that switches between a bufferless and a buffered mode. MinBD is
shown last in each bar group. We make several observations:
1. MinBD improves performance relative to the bufferless deflection router by
2.7 % (4.9 %) in the 4 × 4 (8 × 8) network over all workloads, and 8.1 % (15.2 %)
in the highest-intensity category. Its performance is within 2.7 % (3.9 %) of
the (4, 4) input-buffered router, which is a reasonably-provisioned baseline, and
within 3.1 % (4.2 %) of the (8, 8) input-buffered router, which has large, power-
hungry buffers. Hence, adding a side buffer allows a deflection router to obtain
significant performance improvement, and the router becomes more competitive
with a conventional buffered design.
2. Relative to the 4-VC, 1-flit/VC input-buffered router (third bar), which is the
smallest deadlock-free (i.e., correct) design, MinBD performs nearly the same
despite having less buffer space (4 flits in MinBD vs. 16 flits in (4, 1)-buffered).
Hence, buffering only a portion of traffic (i.e., flits that would have been
deflected) makes more efficient use of buffer space.
3. AFC, the hybrid bufferless/buffered router which switches modes at the router
granularity, performs essentially the same as the 4-VC, 4-flit/VC input-buffered
router, because it is able to use its input buffers when load increases. However, as
we will see, this performance comes at an efficiency cost relative to our hybrid
design.
262
0
5
10
15
20
25
0
4
8
12
16
20
0
16
32
48
64
0
5
10
15
20
0
1
2
3
4
5
0
4
8
12
16
Buf(8,8) Buf(8,8)
Buf(4,4)
Buf(4,4) Buf(4,1)
Buf(4,1) 0. CHIPPER
0. CHIPPER 00 AFC(4,4)
AFC(4,4) MinBD-4
static link
0
static buffer
static link
static other
static buffer
static other
-0
MinBD-4
0.00 - 0.15
-0 .1
5 Buf(8,8)
.0
0.00 - 0.05
5
Buffered (4,1)
Buffered (4,4)
Buffered (8,8)
Buf(4,4)
Buf(8,8) Buf(4,1)
Buf(4,4) CHIPPER
0.
15
Buf(4,1) AFC(4,4)
MinBD-4
0. -0
CHIPPER
0.15 - 0.30
05 .3
0
AFC(4,4) Buf(8,8)
-0
MinBD-4
0.05 - 0.15
Buf(4,4)
.1
5
Buf(4,1)
0.
30
Buf(8,8) CHIPPER
Buf(4,4) AFC(4,4)
-0
MinBD-4
.4
0.30 - 0.40
0. 0
Buf(4,1)
15
CHIPPER Buf(8,8)
dynamic link
dynamic link
dynamic buffer
dynamic buffer
dynamic other
-0
8x8
4x4
AFC(4,4) Buf(4,4)
.2
0.15 - 0.25
5 0.
MinBD-4 Buf(4,1)
40
CHIPPER
Network Intensity
Network Intensity
-0
AFC(4,4)
.5
Buf(8,8)
0.40 - 0.50
> 0
MinBD-4
0. Buf(4,4)
Buf(4,1) >
Buf(8,8)
25
CHIPPER Buf(4,4)
0.
> 0.25
Buf(4,1)
50
AFC(4,4)
MinBD-4 CHIPPER
> 0.50
MinBD-4
Pure Bufferless (CHIPPER)
AFC (4,4)
AFC(4,4)
MinBD-4
A
V
G Buf(8,8) A
V
Buf(4,4) G Buf(8,8)
Buf(4,1) AVG Buf(4,4)
AVG
CHIPPER Buf(4,1)
CHIPPER
AFC(4,4) AFC(4,4)
MinBD-4 MinBD-4
Fig. 10.8 Performance (weighted speedup), network power, and energy efficiency (performance
C. Fallin et al.
10 Bufferless and Minimally-Buffered Deflection Routing 263
Network Power: Figure 10.8 (middle pane) shows average total network power,
split by component and type (static/dynamic), for 4 × 4 and 8 × 8 networks across
the same workloads. Note that static power is shown in the bottom portion of each
bar, and dynamic power in the top portion. Each is split into buffer power, link
power, and other power (which is dominated by datapath components, e.g., the
crossbar and pipeline registers). We make several observations:
1. Buffer power is a large part of total network power in the input-buffered routers
that have reasonable buffer sizes, i.e., (4, 4) and (8, 8) (VCs, flits/VC), even
with empty-buffer bypassing, largely because static buffer power (bottom bar
segment) is significant. Removing large input buffers reduces static power in
MinBD as well as the purely-bufferless baseline, CHIPPER.8 Because of this
reduction, MinBD’s total network power never exceeds that of the input-buffered
baselines, except in the highest-load category in an 8 × 8-mesh (by 4.7 %).
2. Dynamic power is larger in the baseline deflection-based router, CHIPPER,
than in input-buffered routers: CHIPPER has 31.8 % (41.1 %) higher dynamic
power than the (4, 4)-buffered router in the 4 × 4 (8 × 8) networks in the highest-
load category. This is because bufferless deflection-based routing requires more
network hops, especially at high load. However, in a 4 × 4 network, MinBD
consumes less dynamic power (by 8.0 %) than the (4, 4)-buffered baseline in
the highest-load category because reduced deflection rate (by 58 %) makes this
problem relatively less significant, and allows savings in buffer dynamic energy
and a simplified datapath to come out. In an 8 × 8 network, MinBD’s dynamic
power is only 3.2 % higher.
3. MinBD and CHIPPER, which use a permutation network-based datapath rather
than a full 5 × 5 crossbar, reduce datapath static power (which dominates
the “static other” category) by 31.0 %: the decoupled permutation network
arrangement has less area, in exchange for partial permutability (which causes
some deflections). Input-buffered routers and AFC require a full crossbar because
they cannot deflect flits when performing buffered routing (partial permutability
in a non-deflecting router would significantly complicate switch arbitration,
because each output arbiter’s choice would be limited by which other flits are
traversing the router).
4. AFC, the coarse-grained hybrid, has nearly the same network power as the (4, 4)
buffered router at high load: 0.6 % (5.7 %) less in 4 × 4 (8 × 8). This is because
its buffers are enabled most of the time. At low load, when it can power-gate its
buffers frequently, its network power reduces. However, AFC’s network power
8 Note that network power in the buffered designs takes buffer bypassing into account, which
reduces these baselines’ dynamic buffer power. The (4, 4)-buffered router bypasses 73.7 %
(83.4 %) of flits in 4 × 4 (8 × 8) networks. Without buffer bypassing, this router has 7.1 % (6.8 %)
higher network power, and 6.6 % (6.4 %) worse energy-efficiency.
264 C. Fallin et al.
16
cy
Weighted Speedup
15 ien 4,1 1,4 1,8
2,8
f fic 4,2 4,4
14 erE MinBD 2,2 2,4 4,8 8,2 8,4 8,8
ett 2,1 1,2
13 B 1,1 Buffered
12 cy AFC
ien
ffic CHIPPER
11 eE
ors
10 W
9
8
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6
Power (W)
Fig. 10.9 Power (X) vs. application performance (Y) in 4 × 4 networks. The line represents all
points with equivalent performance-per-watt to MinBD
is still higher than the pure bufferless router (CHIPPER) or MinBD because
(i) it still spends some time in its buffered mode, and (ii) its datapath power
is higher, as described above. On average, AFC still consumes 36.8 % (18.1 %)
more network power than CHIPPER, and 33.5 % (33.0 %) more than MinBD, in
the lowest-load category.
Energy efficiency: Figure 10.8 (bottom pane) shows energy efficiency. We make
two key observations:
1. MinBD has the highest energy efficiency of any evaluated design: on average in
4 × 4 (8 × 8) networks, 42.6 % (33.8 %) better than the reasonably-provisioned
(4, 4) input-buffered design. MinBD has 15.9 % (8.7 %) better energy-efficiency
than the most energy-efficient prior design, the (4, 1)-buffered router.
2. At the highest network load, MinBD becomes less energy-efficient compared to
at lower load, and its efficiency degrades at a higher rate than the input-buffered
routers with large buffers (because of deflections). However, its per-category
energy-efficiency is still better than all baseline designs, with two exceptions.
In the highest-load category (near saturation) in an 8 × 8-mesh, MinBD has
nearly the same efficiency as the (4, 1)-buffered router (but, note that MinBD is
much more efficient than this baseline router at lower loads). In the lowest-load
category in a 4 × 4 mesh, the purely-bufferless router CHIPPER is slightly more
energy-efficient (but, note that CHIPPER’s performance and efficiency degrade
quickly at high loads).
We conclude that, by achieving competitive performance with the buffered
baseline, and making more efficient use of a much smaller buffer capacity (hence
reducing buffer power and total network power), MinBD provides better energy
efficiency than prior designs. To summarize this result, we show a 2D plot of power
and application performance in Fig. 10.9 for 4 × 4 networks, and a wider range of
buffered router designs, as well as MinBD and CHIPPER. (Recall from Sect. 10.5
that several of the baseline input-buffered designs are not deadlock free (too few
VCs) or have a buffer depth that does not cover credit round-trip latency, but we
evaluate them anyway for completeness.) In this plot, with power on the X axis
and application performance on the Y axis, a line through the origin represents a
10 Bufferless and Minimally-Buffered Deflection Routing 265
fixed performance-per-watt ratio (the slope of the line). This equal-efficiency line
is shown for MinBD. Points above the line have better efficiency than MinBD, and
points below have worse efficiency. As shown, MinBD presents the best energy
efficiency among all evaluated routers. The trend in an 8 × 8 network (not shown
for space) is similar (see technical report [18]).
To understand the observed performance gain in more detail, we now break down
performance by each component of MinBD. Figure 10.10 shows performance (for
4 × 4 networks) averaged across all workloads for eight deflection systems, which
constitute all possible combinations of MinBD’s mechanisms added to the baseline
(CHIPPER) router: dual-width ejection (D), silver-flit prioritization (S), and the
side buffer (B), shown with the same three input-buffered configurations as before.
The eighth bar (D + S + B), which represents all three mechanisms added to the
baseline deflection network, represents MinBD. Table 10.2 shows deflection rate
for the same set of systems.
We draw three main conclusions:
1. All mechanisms individually contribute to performance and reduce deflection
rate. Dual ejection (D) increases performance by 3.7 % over baseline CHIPPER.9
9 The main results presented in Fig. 10.8 use this data point (with dual ejection) in order to make a
fair (same external router interface) comparison.
266 C. Fallin et al.
Latency 20 20
15 15
10 10
5 5
20 20
Latency
15 15
10 10
5 5
20 20
Latency
15 15
10 10
5 5
20 20
Latency
15 15
10 10
5 5
Fig. 10.11 Synthetic traffic evaluations for MinBD, CHIPPER and input-buffered routers (with
small and large input buffers), in 4 × 4 and 8 × 8 meshes
Side Buffer Size: As side buffer size is varied from 1 to 64 flits, mean weighted
speedup (application performance) changes only 0.2 % on average across all
workloads (0.9 % in the highest-intensity category) in 4 × 4 networks. We conclude
that the presence of the buffer (to buffer at least one deflected flit) is more important
than its size, because the average utilization of the buffer is low. In a 4 × 4 MinBD
network with 64-flit side buffers at saturation (61 % injection rate, uniform random),
the side buffer is empty 48 % of the time on average; 73 % of the time, it contains
4 or fewer flits; 93 % of the time, 16 or fewer. These measurements suggest that
a very small side buffer captures most of the benefit. Furthermore, total network
power increases by 19 % (average across all 4 × 4 workloads) when a 1-flit buffer
per router is increased to a 64-flit buffer per router. Hence, a larger buffer wastes
power without significant performance benefit.
We avoid a 1-flit side buffer because of the way the router is pipelined: such
a single-flit buffer would either require for a flit to be able to enter, then leave,
the buffer in the same cycle (thus eliminating the independence of the two router
pipeline stages), or else could sustain a throughput of one flit only every two cycles.
(For this sensitivity study, we optimistically assumed the former option for the 1-flit
case.) The 4-flit buffer we use avoids this pipelining issue, while increasing network
power by only 4 % on average over the 1-flit buffer.
Note that although the size we choose for the side buffer happens to be the same
as the 4-flit packet size which we use in our evaluations, this need not be the case.
In fact, because the side buffer holds deflected flits (not packets) and deflection
decisions occur at a per-flit granularity, it is unlikely that the side buffer will hold
more than one or two flits of a given packet at a particular time. Hence, unlike
conventional input-buffered routers which typically size a buffer to hold a whole
packet, MinBD’s side buffer can remain small even if packet size increases.
Packet Size: Although we perform our evaluations using a 4-flit packet size, our
conclusions are robust to packet size. In order to demonstrate this, we also evaluate
MinBD, CHIPPER, and the (4, 4)- and (8, 8)-input-buffered routers in 4 × 4 and
8 × 8 networks using a data packet size of 8 flits. In a 4 × 4 (8 × 8) network, MinBD
improves performance over CHIPPER by 17.1 % (22.3 %), achieving performance
within 1.2 % (8.1 %) of the (4, 4)-input-buffered router and within 5.5 % (12.8 %)
of the (8, 8)-input-buffered router, while reducing average network power by 25.0 %
(18.1 %) relative to CHIPPER, 16.0 % (9.4 %) relative to the (4, 4)-input-buffered
router, and 40.3 % (34.5 %) relative to the (8, 8)-input-buffered router, respectively.
MinBD remains the most energy-efficient design as packet size increases.
10 Bufferless and Minimally-Buffered Deflection Routing 269
Table 10.3 Normalized router area and critical path for bufferless and buffered baselines,
compared to MinBD
Buffered Buffered Buffered
Router design CHIPPER MinBD (8, 8) (4, 4) (4, 1)
Normalized area 1.00 1.03 2.06 1.69 1.60
Normalized critical path 1.00 1.07 0.99 0.99 0.99
length
We present normalized router area and critical path length in Table 10.3. Both
metrics are normalized to the bufferless deflection router, CHIPPER, because it has
the smallest area of all routers. MinBD adds only 3 % area overhead with its small
buffer. In both CHIPPER and MinBD, the datapath dominates the area. In contrast,
the large-buffered baseline has 2.06× area, and the reasonably-provisioned buffered
baseline has 1.69× area. Even the smallest deadlock-free input-buffered baseline
has 60 % greater area than the bufferless design (55 % greater than MinBD). In
addition to reduced buffering, the reduction seen in CHIPPER and MinBD is partly
due to the simplified datapath in place of the 5 × 5 crossbar (as also discussed in
Sect. 10.6.2). Overall, MinBD reduces area relative to a conventional input-buffered
router both by significantly reducing the required buffer size, and by using a more
area-efficient datapath.
Table 10.3 also shows the normalized critical path length of each router
design, which could potentially determine the network operating frequency. MinBD
increases critical path by 7 % over the bufferless deflection router, which in turn
has a critical path 1 % longer than an input-buffered router. In all cases, the critical
path is through the flit arbitration logic (the permutation network in MinBD and
CHIPPER, or the arbiters in the input-buffered router). MinBD increases critical
path relative to CHIPPER by adding logic in the deflection-routing stage to pick a
flit to buffer, if any. The buffer re-injection and redirection logic in the first pipeline
stage (ejection/injection) does not impact the critical path because the permutation
network pipeline stage has a longer critical path.
To our knowledge, MinBD is the first NoC router design that combines deflection
routing with a small side buffer that reduces deflection rate. Other routers combine
deflection routing with buffers, but do not achieve the efficiency of MinBD because
they either continue to use input buffers for all flits (Chaos router) or switch all
buffers on and off at a coarse granularity with a per-router mode switch (AFC), in
contrast to MinBD’s fine-grained decision to buffer or deflect each flit.
270 C. Fallin et al.
Buffered NoCs that also use deflection: Several routers that primarily operate
using buffers and flow control also use deflection routing as a secondary mechanism
under high load. The Chaos Router [32] deflects packets when a packet queue
becomes full to probabilistically avoid livelock. However, all packets that pass
through the router are buffered; in contrast, MinBD performs deflection routing
first, and only buffers some flits that would have been deflected. This key aspect of
our design reduces buffering requirements and buffer power. The Rotary Router [1]
allows flits to leave the router’s inner ring on a non-productive output port after
circulating the ring enough times, in order to ensure forward progress. In this case,
again, deflection is used as an escape mechanism to ensure probabilistic correctness,
rather than as the primary routing algorithm, and all packets must pass through the
router’s buffers.
Other bufferless designs: Several prior works propose bufferless router designs
[17, 21, 26, 38, 47]. We have already extensively compared to CHIPPER [17],
from which we borrow the deflection routing logic design. BLESS [38], another
bufferless deflection network, uses a more complex deflection routing algorithm.
Later works showed BLESS to be difficult to implement in hardware [17, 26, 37],
thus we do not compare to it in this work. Other bufferless networks drop rather
than deflect flits upon contention [21, 26]. Some earlier large multiprocessor
interconnects, such as those in HEP [42] and Connection Machine [27], also used
deflection routing. The HEP router combined some buffer space with deflection
routing (Smith, 2008, Personal communication). However, these routers’ details
are not well-known, and their operating conditions (large off-chip networks) are
significantly different than those of modern NoCs.
More recently, Fallin et al. [19] applied deflection routing to a hierarchical ring
topology, allowing most routers (those that lie within a ring) to be designed without
any buffering or flow control, and using only small buffers to transfer between rings.
The resulting design, HiRD, was shown to be more energy-efficient than the baseline
hierarchical ring with more buffering. HiRD uses many of the same general ideas
as MinBD to ensure forward progress, e.g., enforcing explicit forward-progress
guarantees in the worst case without impacting common-case complexity.
Improving high-load performance in bufferless networks: Some work has pro-
posed congestion control to improve performance at high network load in bufferless
deflection networks [8, 39, 40]. Both works used source throttling: when network-
intensive applications cause high network load which degrades performance for
other applications, these intensive applications are prevented from injecting network
traffic some of the time. By reducing network load, source throttling reduces
deflection rate and improves overall performance and fairness. These congestion
control techniques and others (e.g., [46]) are orthogonal to MinBD, and could
improve MinBD’s performance further.
Hybrid buffered-bufferless NoCs: AFC [29] combines a bufferless deflection
router based on BLESS [38] with input buffers, and switches between bufferless
deflection routing and conventional input-buffered routing based on network load at
each router. While AFC has the performance of buffered routing in the highest-load
10 Bufferless and Minimally-Buffered Deflection Routing 271
case, with better energy efficiency in the low-load case (by power-gating buffers
when not needed), it misses opportunity to improve efficiency because it switches
buffers on at a coarse granularity. When an AFC router experiences high load, it
performs a mode switch which takes several cycles in order to turn on its buffers.
Then, it pays the buffering energy penalty for every flit, whether or not it would have
been deflected. It also requires buffers as large as the baseline input-buffered router
design in order to achieve equivalent high-load performance. As a result, its network
power is nearly as high as a conventional input-buffered router at high load, and it
requires fine-grained power gating to achieve lower power at reduced network load.
In addition, an AFC router has a larger area than a conventional buffered router,
because it must include both buffers and buffered-routing control logic as well as
deflection-routing control logic. In contrast, MinBD does not need to include large
buffers and the associated buffered-mode control logic, instead using only a smaller
side buffer. MinBD also removes the dependence on efficient buffer power-gating
that AFC requires for energy-efficient operation at low loads. We quantitatively
compared MinBD to AFC in Sect. 10.4 and demonstrated better energy efficiency at
all network loads.
Reducing cost of buffered routers: Empty buffer bypassing [37, 49] reduces
buffered router power by allowing flits to bypass input buffers when empty.
However, as our evaluations (which faithfully model the power reductions due to
buffer bypassing) show, this scheme reduces power less than our new router design:
bypassing is only effective when buffers are empty, which happens more rarely as
load increases. Furthermore, buffers continue to consume static power, even when
unused. Though both MinBD and empty-buffer-bypassed buffered routers avoid
buffering significant traffic, MinBD further reduces router power by using much
smaller buffers.
Kim [30] proposed a low-cost buffered router design in which a packet uses a
buffer only when turning, not when traveling straight along one dimension. Unlike
our design, this prior work does not make use of deflection, but uses deterministic
X-Y routing. Hence, it is not adaptive to different traffic patterns. Furthermore,
its performance depends significantly on the size of the turn-buffers. By using
deflection, MinBD is less dependent on buffer size to attain high performance, as
we argued in Sect. 10.6.5. In addition, [30] implements a token-based injection
starvation avoidance scheme which requires additional communication between
routers, whereas MinBD requires only per-router control to ensure side buffer
injection.
10.8 Conclusion
is placed in the buffer instead. Previous router designs which use buffers typically
place these buffers at the router inputs. In such a design, energy is expended to read
and write the buffer for every flit, and buffers must be large enough to efficiently
handle all arriving traffic. In contrast to prior work, a MinBD router uses its buffer
for only a fraction of network traffic, and hence makes more efficient use of a given
buffer size than a conventional input-buffered router. Its average network power
is also greatly reduced: relative to an input-buffered router, buffer power is much
lower, because buffers are smaller. Relative to a bufferless deflection router, dynamic
power is lower, because deflection rate is reduced with the small buffer.
We evaluate MinBD against a comprehensive set of baseline router designs:
three configurations of an input-buffered virtual-channel router [12], a bufferless
deflection router, CHIPPER [17], and a hybrid buffered-bufferless router, AFC [29].
Our evaluations show that MinBD performs competitively and reduces network
power: on average in a 4 × 4 network, MinBD performs within 2.7 % of the input-
buffered design (a high-performance baseline) while consuming 31.8 % less total
network power on average relative to this input-buffered router (and 13.4 % less
than the bufferless router, which performs worse than MinBD). Finally, MinBD has
the best energy efficiency among all routers which we evaluated. We conclude that
a router design which augments bufferless deflection routing with a small buffer
to reduce deflection rate is a compelling design point for energy-efficient, high-
performance on-chip interconnect.
Acknowledgements We thank the anonymous reviewers of our conference papers CHIPPER [17]
and MinBD [20] for their feedback. We gratefully acknowledge members of the SAFARI group
and Michael Papamichael at CMU for feedback. Chris Fallin is currently supported by an
NSF Graduate Research Fellowship (Grant No. 0946825). Rachata Ausavarungnirun is currently
partially supported by the Royal Thai Government Scholarship. Onur Mutlu is partially supported
by the Intel Early Career Faculty Honor Program Award. Greg Nazario and Xiangyao Yu were
undergraduate research interns while this work was done. We acknowledge the generous support
of our industrial partners, including AMD, HP Labs, IBM, Intel, NVIDIA, Oracle, Qualcomm,
and Samsung. This research was partially supported by grants from NSF (CAREER Award CCF-
0953246, CCF-1147397 and CCF-1212962). This article is a significantly extended and revised
version of our previous conference papers that introduced CHIPPER [17] and MinBD [20].
References
1. P. Abad et al., Rotary router: an efficient architecture for CMP interconnection networks, in
ISCA-34, San Diego, 2007
2. J. Balfour, W.J. Dally, Design tradeoffs for tiled CMP on-chip networks, in ICS, Cairns, 2006
3. P. Baran, On distributed communications networks. IEEE Trans. Commun. Syst. CS-12, 1–9
(1964)
4. S. Borkar, Thousand core chips: a technology perspective, in DAC-44, San Diego, 2007
5. S. Borkar, Future of interconnect fabric: a contrarian view, in SLIP’10, Anaheim, 2010
6. S. Borkar, NoCs: What’s the point? in NSF Workshop on Emerging Technologies for Intercon-
nects (WETI), Washington, DC, Feb 2012
7. P. Bose, The power of communication: trends, challenges (and accounting issues), in NSF
WETI, Washington, DC, Feb 2012
8. K. Chang et al., HAT: heterogeneous adaptive throttling for on-chip networks, in SBAC-PAD,
New York, 2012
9. D.E. Culler et al., Parallel Computer Architecture: A Hardware/Software Approach (Morgan
Kaufmann, San Francisco, 1999)
10. W. Dally, C. Seitz, Deadlock-free message routing in multiprocessor interconnection networks.
IEEE Trans. Comput. 36, 547–553 (1987)
11. W.J. Dally, B. Towles, Route packets, not wires: on-chip interconnection networks, in DAC-38,
Las Vegas, 2001
12. W. Dally, B. Towles, Principles and Practices of Interconnection Networks (Morgan Kauf-
mann, Amsterdam/San Francisco, 2004)
13. R. Das, O. Mutlu, T. Moscibroda, C. Das, Application-aware prioritization mechanisms for
on-chip networks, in MICRO-42, New York, 2009
14. R. Das et al., Aérgia: exploiting packet latency slack in on-chip networks, in ISCA-37, Saint-
Malo, 2010
15. R. Das et al., Design and evaluation of hierarchical on-chip network topologies for next
generation CMPs, in HPCA-15, Raleigh, 2009
16. S. Eyerman, L. Eeckhout, System-level performance metrics for multiprogram workloads.
IEEE Micro 28, 42–53 (2008)
17. C. Fallin et al., CHIPPER: a low-complexity bufferless deflection router, in HPCA-17, San
Antonio, 2011
18. C. Fallin et al., MinBD: minimally-buffered deflection routing for energy-efficient intercon-
nect. SAFARI technical report TR-2011-008: http://safari.ece.cmu.edu/tr.html (2011)
19. C. Fallin et al., HiRD: a low-complexity, energy-efficient hierarchical ring interconnect.
SAFARI technical report TR-2012-004: http://safari.ece.cmu.edu/tr.html (2012)
20. C. Fallin et al., MinBD: minimally-buffered deflection routing for energy-efficient intercon-
nect, in NOCS-4, Copenhagen, 2012
274 C. Fallin et al.
21. C. Gómez et al., Reducing packet dropping in a bufferless NoC, in Euro-Par-14, Las Palmas
de Gran Canaria, 2008
22. B. Grot, J. Hestness, S. Keckler, O. Mutlu, Express cube topologies for on-chip interconnects,
in HPCA-15, Raleigh, 2009
23. B. Grot, S. Keckler, O. Mutlu, Preemptive virtual clock: a flexible, efficient, and cost-effective
QOS scheme for networks-on-chip, in MICRO-42, New York, 2009
24. B. Grot et al., Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and
service guarantees, in ISCA-38, San Jose, 2011
25. A. Hansson et al., Avoiding message-dependent deadlock in network-based systems-on-chip.
VLSI Des. 2007, 1–10 (2007)
26. M. Hayenga et al., SCARAB: a single cycle adaptive routing and bufferless network, in
MICRO-42, New York, 2009
27. W. Hillis, The Connection Machine (MIT, Cambridge, 1989)
28. Y. Hoskote et al., A 5-GHz mesh interconnect for a teraflops processor. IEEE Micro 27, 51–61
(2007)
29. S.A.R. Jafri et al., Adaptive flow control for robust performance and energy, in MICRO-43,
Atlanta, 2010
30. J. Kim, Low-cost router microarchitecture for on-chip networks, in MICRO-42, New York,
2009
31. J. Kim, J. Balfour, W. Dally, Flattened butterfly topology for on-chip networks, in MICRO-40,
Chicago, 2007
32. S. Konstantinidou, L. Snyder, Chaos router: architecture and performance, in ISCA-18,
Toronto, 1991
33. D. Kroft, Lockup-free instruction fetch/prefetch cache organization, in ISCA-8, Minneapolis,
1981
34. J. Laudon, D. Lenoski, The SGI Origin: a ccNUMA highly scalable server, in ISCA-24,
Boulder, 1997
35. J. Lee, M. Ng, K. Asanovic, Globally-synchronized frames for guaranteed quality-of-service
in on-chip networks, in ISCA-35, Beijing, 2008
36. C.K. Luk et al., Pin: building customized program analysis tools with dynamic instrumentation,
in PLDI, Chicago, 2005
37. G. Michelogiannakis et al., Evaluating bufferless flow-control for on-chip networks, in NOCS,
Grenoble, 2010
38. T. Moscibroda, O. Mutlu, A case for bufferless routing in on-chip networks, in ISCA-36,
Austin, 2009
39. G. Nychis et al., Next generation on-chip networks: what kind of congestion control do we
need? in HotNets-IX, Monterey, 2010
40. G. Nychis et al., On-chip networks from a networking perspective: congestion and scalability
in many-core interconnects, in SIGCOMM, Helsinki, 2012
41. H. Patil et al., Pinpointing representative portions of large Intel Itanium programs with dynamic
instrumentation, in MICRO-37, Portland, 2004
42. B. Smith, Architecture and applications of the HEP multiprocessor computer system, in SPIE,
San Diego, 1981
43. A. Snavely, D.M. Tullsen, Symbiotic jobscheduling for a simultaneous multithreaded proces-
sor, in ASPLOS-9, Cambridge, 2000
44. Standard Performance Evaluation Corporation: SPEC CPU2006 (2006), http://www.spec.org/
cpu2006
45. M. Taylor et al., The Raw microprocessor: a computational fabric for software circuits and
general-purpose programs. IEEE Micro 22, 25–35 (2002)
46. M. Thottethodi, A. Lebeck, S. Mukherjee, Self-tuned congestion control for multiprocessor
networks, in HPCA-7, Nuevo Leone, 2001
47. S. Tota et al., Implementation analysis of NoC: a MPSoC trace-driven approach, in GLSVLSI-
16, Philadelphia, 2006
10 Bufferless and Minimally-Buffered Deflection Routing 275
11.1 Introduction
11.1.1 Background
[16]. Network-on-chip (NoC) has been proposed to tackle this distinctive challenge
[2]. NoCs utilize the interconnected routers instead of buses or point-to-point wires
to send and receive packets between processor elements (PE), which overcome the
scalability limitations of the buses and bring significant performance improvement
in terms of communication latency, power consumption and reliability etc.
For the design of NoC based high performance MPSoC, power and temperature
have become the dominant constraints [27]. Higher power consumption will lead to
higher temperature and at the same time, the uneven power consumption distribution
will create thermal hotspots. These thermal hotspots have adverse effect on the
carrier mobility, the meantime between failure (MTBF), and also the leakage
current of the chip. As a result, they will degrade the performance and reliability
dramatically. Consequently, it is highly desirable to have an even power and thermal
profile across the chip [26]. This imposes as a design constraint for NoC to avoid
uneven power consumption profile so as to reduce the hotspot temperature.
In a typical NoC-based MPSoC design flow, we need to allocate and schedule the
tasks on the available processor cores and map these processors to the NoC platform
first. After the task and processor mapping, routing algorithm is developed to decide
the physical paths (i.e., the routers to be traversed) for sending the packets from the
sources to the destinations. Each phase of the NoC design affects the total power
consumption as well as the power profile across the chip [25]. Previously, several
task mapping and processor core floorplan algorithms [15] have been proposed to
achieve a thermal balanced NoC design. However, the routing algorithm is rarely
exploited for this purpose. Since the communication network (including the routers
and the physical links) consumes a significant portion of the chip’s total power
budget (such as 39 % of total tile power in [29]), the decision on the routing path
of the packets will greatly affect the power consumption distribution and hence
the overall hotspot temperature across the chip [27]. Therefore, it is important to
consider the thermal constraint in the routing phase of the NoC design.
In this chapter, we exploit the adaptive routing algorithm to achieve an even
temperature distribution for application-specific MPSoCs. Given an application
described by a task flow graph and a target NoC topology, we assume that the tasks
are already scheduled and allocated to the processors and the processor mapping is
also done. We then utilize the traffic information of the application specified in the
task flow graph, which can be obtained through profiling [4, 14], to decide how to
split the traffic among different physical routes for a balanced power distribution.
Figure 11.1 shows an example of the task flow graph and the corresponding mesh
topology mapping for a video object plane decoder (VOPD) application [4]. As
shown in Fig. 11.1, each tile is made up of a processor element (PE) which executes
certain operations and a router which is connected to its neighbors as well as a local
PE. The routing algorithm is designed to determine the routers to be traversed for
11 A Thermal Aware Routing Algorithm for Application-Specific NoC 279
Vop 313 16
memory padding ARM VOP Padding
core Memory
94
R R R
Fig. 11.1 Task graph and tile mapping for the Video Object Plane Decoder (VOPD) [4] applica-
tion on 4 × 3 NoC
each communication. Of note, when we decide the routing for each traffic pair, we
have to make sure all the routing paths are deadlock free [16].
Similar to [9], we use the peak power (i.e., the peak energy under a given time
window) metrics to evaluate the effectiveness in reducing the hotspot temperature.
Through simulation-based evaluation, we demonstrate the proposed algorithm can
reduce the peak energy of the tiles by 10–20 % while improving or maintaining the
NoC performance in terms of throughput and latency.
In the area of temperature-aware NoC design, many previous works focus on the
power consumption distribution of the processor cores. In [18], a dynamic task
migration algorithm was proposed to reduce the hotspot temperature due to the
processor core (i.e., PE). In [1], a thermal management hardware infrastructure
was implemented to adjust the frequency and voltage of the processing elements
according to the temperature requirements at run time.
Since the power consumption of the routers is as significant as the processor core,
the thermal constraint should also be addressed in the routing algorithm design.
There have been a lot of works on NoC routing algorithms for various purposes
including low power routing [19], fault-tolerant routing [11] and congestion avoid-
ance routing to improve latency [17]. However, there are only a few works [7, 9, 27]
taking temperature issue into account.
In [9], an ant-colony-based dynamic routing algorithm was proposed to reduce
the peak power. Heavy packet traffics are distributed in the routing based on this
280 Z. Qian and C.-Y. Tsui
dynamic algorithm to minimize the occurrence of hot spots. However this dynamic
routing algorithm is generic in nature and does not take into account the application-
specific traffic information. Therefore it may not be capable to achieve an optimal
path distribution. Special control packets are sent among the routers to implement
the algorithm which increases the power overhead. Also, two additional forward
and backward ant units are needed in the router which results in a large area
overhead. Moreover, this work only minimizes the peak power of the routers but
does not consider the effect of the processor core power on the temperature. In [27],
a run-time thermal-correlation based routing algorithm is developed. When the peak
temperature of the chip exceeds a predefined threshold, the NoC is under thermal
emergency and the dynamic algorithm will throttle the load or re-route the packets
using the paths of the least thermal correlation with the run time hottest regions. The
algorithm also does not consider the specific traffic information of the applications.
It may be inefficient if multiple hotspots occur at the same time. Moreover, it is
not very clear how to do the re-routing while still guaranteeing the deadlock free
property. In [7], a new routing-based traffic migration algorithm VDLAPR and the
buffer allocation scheme are proposed to trade-off between the load balanced and the
temperature balanced routing for 3D NoCs. In particular, the VDLAPR algorithm is
designed for 3D NoCs by distributing the traffic among various layers. For routing
within each layer, a thermal-aware routing algorithm such as the one introduced in
this chapter is still needed.
In this chapter, we propose a thermal-aware routing for application specific NoC.
To guarantee deadlock-free property, generic routing schemes use turn model based
algorithms such as X-Y routing, odd-even routing or forbidden turn routing [16].
However, this will limit the flexibility of re-distributing the traffic to achieve an even
power consumption profile. Here, we further utilize the characterized application
traffic information to achieve a larger path set for routing and at the same time
provide deadlock avoidance. Higher adaptivity and hence better performance can
be achieved since more paths can be used for the re-distribution of the traffic. Given
the set of possible routing paths, we formulate the problem of allocating the optimal
traffic among all paths as a mathematical programming problem. At run time, the
routing decisions will be made distributively according to the calculated traffic
splitting ratios. We demonstrate the effectiveness of the proposed routing strategy
on peak energy reduction using both synthetic and real application traffics.
We aim to achieve an even power consumption distribution and reduce the hotspot
temperature for application specific MPSoC using NoC connection. We assume
that the given application is specified by a task flow graph which characterizes the
11 A Thermal Aware Routing Algorithm for Application-Specific NoC 281
a b c
L2_1 L4_1
P2 P0 P1 P2
L2_5 L1_2
L1_4
L5_2
P7 P3
L5_4 L4_5
P4 P3 P4 P5 L3_6
L4_7
P6
P6 P7 P8 L6_3 L7_4
Fig. 11.2 An example of (a) Core communication graph (CCG) (b) NoC architecture (c) Channel
dependency graph (CDG) and its strongly connected components (SCC) after mapping (a) onto (b)
d (i, j)
minimum length paths connecting the two tiles is given by Ni j = Cd(i, x
j)
1 since
we need to traverse along the x direction dx (i, j) times among total d(i, j) number
of hops. Let l(i, j, k) (k ≤ Ni j ) denote the kth path connecting the two tiles. If
l(i, j, k) traverses two network links (m, n) and (n, p) consecutively, then an edge
is added to connect the two nodes Lmn and Lnp in the CDG. By inspecting all the
feasible paths, we can construct the whole channel dependency graph. In Fig. 11.2,
taking the communication pair (2, 4) as an example, two minimum length paths
exist: (2 → 5 → 4) and (2 → 1 → 4). In the CDG, we add two edges connecting
(L2−5, L5−4 ) and (L2−1 , L1−4 ), respectively. As shown in Fig. 11.2c, there are in total
10 strongly connected components in the resulting CDG which can be found by the
Tarjan’s algorithm [28]. We can find two circles in Fig. 11.2c, i.e., L3−4 → L4−7 →
L7−6 → L6−3 → L3−4 and L4−3 → L3−6 → L6−7 → L7−4 → L4−3 , where some edges
are required to be removed by the path set finding algorithm in Sect. 11.4.1 to avoid
deadlocks.
Definition 5. Routing adaptivity: Routing adaptivity is defined as the ratio of the
total number of available paths provided by a routing algorithm to the number of all
possible minimum length paths between the source and destination pairs. Higher
adaptivity will improve the capability of avoiding congestion and redistributing
the traffic to reduce thermal hotspot. However, we need to meet the deadlock-free
constraints while improving the routing adaptivity.
For a given application, we first use the task graph T G (Definition 1) to characterize
its traffic patterns as well as the communication volume between the tasks. Then
based on the mapping algorithm on the target NoC platform, the task graph T G
is transformed to the core communication graph CCG (Definition 2). The channel
1Cr
n is the combination notation, where Cnr = n!
r!(n−r)!
11 A Thermal Aware Routing Algorithm for Application-Specific NoC 283
a c
T1 T3 T4
T5 T0
b
Path 6
P0 P1 P2
Path 2 Path 3
Path 1 Path 4
P3 P4 P5
Path 5
Fig. 11.3 A motivation example of allocating traffic among paths (a) the core communication
graph (CCG) (b) routing paths allocation on 2 × 3 NoC (c) two strategies of using the routing paths
(d) thermal profile comparison (left: strategy 1, right: strategy 2)
dependency graph CDG (Definition 3) is built based on the routing paths assigned
to each edge in CCG. In this chapter, we propose to find the strongly connected
components SCC (Definition 4) of the underlying CDG to remove the circular de-
pendency among the channel resources to avoid potential deadlock. Finally, routing
adaptivity (Definition 5) is the metric that reflects the capability of redistributing
traffic for different cycle breaking algorithms in SCC.
For adaptive routing, usually there will be more than one path available for every
communication pair. If traffic is distributed equally on all the paths, some of the
routers may have more paths passing through them and hence more packets to
receive and send. We show the need to allocate traffic properly among the paths
by an example illustrated in Fig. 11.3.
In Fig. 11.3, three communication pairs are assumed to occur concurrently: from
tile P3 to P1, tile P1 to P5 and tile P4 to P0. We consider the case of 1,000 packets
that are generated for each communication pair and sent over the network in this
example. The energy consumption of processing a single packet in a router is
denoted as E1 . We compare two routing strategies in the figure. Strategy 1 uses
284 Z. Qian and C.-Y. Tsui
uniform traffic distribution among all paths between the source and the destination
nodes. Strategy 2 allocates traffic non-uniformly among the candidate paths. The
total number of packets handled by each router is different in these two schemes.
The router energy distribution is shown in Fig. 11.3c. By properly allocating the
traffic, we can reduce the peak energy of the tile by 16 % and the energy difference
among the tiles by 37.5 %. Moreover, we can further assume the average energy
consumption of the processors and the routers are about the same and then use
Hotspot [13] to simulate and evaluate the thermal profile across the chip. As shown
in Fig. 11.3d, strategy 2 indeed makes the thermal profile more uniform and reduces
the hotspot temperature.
From the above example, we can see that we need to find a set of paths for routing
the packets for every communication pair and allocate the traffic properly among
these paths so as to achieve a uniform power consumption profile. One critical
issue of determining the path set is to provide deadlock avoidance. In generic
routing algorithms, deadlock can be prevented by disallowing various turns [16].
For application specific NoCs, it will be too conservative and unnecessarily prohibit
some legitimate paths to be used [22] as some disallowed turns can actually be used
because the application does not have traffics interacting with these turns to form
circular dependencies in the CDG. By using the information specified by the task
flow graph and its derived communication graph, we can find more paths available
for routing which increases the flexibility of distributing traffic among the sources
and the destinations. In this chapter, we use a similar approach as [22] to find the set
of admissible paths for each communication pair while still satisfying the deadlock
free requirement.
After obtaining the admissible path set for routing, we then find the optimal
traffic allocation to each path based on the bandwidth requirement for the purpose
of achieving a uniform power profile. The problem is formulated as a mathematical
optimization problem and solved by a LP solver. These phases of design are done
offline at the design time. After that, the optimal distribution ratio of each path
is obtained. For each particular source-destination communication pair, the ratios
of the paths are converted into the probabilities of using each port to route to the
destination for each router. At run time, these probability values are stored into the
routing tables in the routers. For each incoming packet, the router will inquiry its
routing table and return the candidate output ports for this packet according to the
input direction and the destination. The final output port will be chosen according
to the probability value of each candidate.
Figure 11.4 summarizes the whole design flow of our proposed methodology. In
the next section, the details of the algorithm will be presented.
11 A Thermal Aware Routing Algorithm for Application-Specific NoC 285
Application
Communication pairs:
C0 C2 C15 P0 P1 P2 P3
C0 -> C2
C2 -> C15
C6 ->C15
C14 C5 P4 P5 P6 P7
C4->C9
C11 C6
C9->C3
C11->C0 C1 C13
P8 P9 P10 P11
C5->C14
C4->C11 C4 C9 C3
C1->C13 P12 P13 P14 P15
C5->C15 (b) An example Core Communication Graph
C3->C6
Building Channel
Dependency
graph (CDG)
L2_6
L6_2
L6_5 L11_7
In this section, we present the details of the offline routing algorithms. We first
discuss the algorithm of finding the set of admissible paths for each source-
destination communication pair. The admissible paths avoid the circular dependency
among any paths and hence provide the deadlock free property. Then we present the
optimal traffic allocation problem formulation.
In [22], a dynamic routing algorithm which increases the average routing adaptivity
while maintaining the deadlock-free property at the same time is proposed. The
286 Z. Qian and C.-Y. Tsui
L0_1 L1_0
L4_3 L3_4
L4_3 L3_4
P0->P3->P4 Find Path Set for P0->P3->P4
All Paths: P0->P1->P4 P4->P1->P2
Routing:
P0->P1 P4->P1->P2 P4->P5->P2
P0->P1
P1->P0 P4->P5->P2 P2->P5->P4-
P1->P0
P1->P0->P3 P2->P5->P4- >P3
P1->P0->P3
P1->P4->P3 >P3 P2->P1->P4-
P1->P4->P3
P3->P4->P1 P2->P1->P4- >P3
P3->P4->P1
P3->P0->P1 >P3 P2->P1->P0-
P3->P0->P1
P3->P4 P2->P1->P0- >P3
P3->P4
P4->P3->P0 >P3 P5->P4->P1
P4->P3->P0
P4->P1->P0 P5->P4->P1 P5->P2->P1
P5->P2->P1
Fig. 11.5 Application specific and deadlock free path finding algorithm
average routing adaptivity is often used to represent the degree of adaptiveness and
flexibility of a routing algorithm. Here we use a similar approach. However, different
from [22] which focuses on maximizing the average routing adaptivity to improve
the latency, we aim to maximize the flexibility to re-divert the traffic so as to even out
the power consumption distribution. Therefore we need to consider the bandwidth
requirement of each communication also since the amount of packet processed in
each flow contributes differently to the overall energy distribution.
Figure 11.5 shows the main flow of our path finding algorithm. Similar to [22],
based on the application’s task flow graph, we examine all the paths between the
source and destination pairs to build the application channel dependency graph
(ACDG). Most likely, there will be cycles in the ACDG so that some edges have
to be removed to break these cycles to guarantee deadlock free. In [22], a branch
and bound algorithm is used to select the set of edges to remove all the cycles while
11 A Thermal Aware Routing Algorithm for Application-Specific NoC 287
maintaining all the connectivity and maximizing the average adaptivity. The average
adaptivity α is defined as:
1 |ΦSedge (c)|
max α = max ∑
|C| c∈C
αc ; αc =
|Φ (c)|
(11.1)
where
ΦSedge (c) = Φ (c) \ {p|p ∈ Φ (c) p ∈ Pμν ∀Rμν ∈ Sedge } (11.2)
The notations used in the above equations are summarized in Table 11.1.
Connectivity is guaranteed for every communication c by making sure that at
least one path exists between the source and destination nodes. So we have
|ΦSedge (c)| 1; ∀c ∈ C. In this chapter, instead of finding all cycles in the ACDG and
breaking each cycle separately, as done in [22], we apply Tarjan’s algorithm [28]
to find all the strongly connected components (SCC) and try to eliminate cycles
within each nontrivial components (i.e., components with more than one vertex).
One important property of SCC is that cycles of a directed graph are contained
in the same component. The cycles can then be eliminated within each nontrivial
components to achieve the deadlock free. Tarjan’s algorithm is used due to its lower
complexity (O(|V | + |E|)). In addition, in many cases, several edges are shared
among different cycles (as illustrated in Fig. 11.4, the two edges (L3−7 , L7−6 ) and
(L7−3 , L3−2 ) are shared among several cycles). If we inspect each cycle separately,
we may consider these edges more than once. On the other hand, when we employ
Tarjan’s algorithm, cycles with common edges are in the same component and hence
decision can be made more efficiently if we remove some shared edges to break
these cycles simultaneously. When we choose edges to break the cycle, instead
of optimizing the average routing adaptivity, we maximize the following objective
function:
1 1 |ΦSedge (c)|
max ∑
|C| c∈C
αcWc = max ∑
|C| c∈C
W (c) ×
|Φ (c)|
(11.3)
288 Z. Qian and C.-Y. Tsui
Here in Eq. 11.3, we weight the routing adaptivity of each communication with
its corresponding bandwidth requirement (W (c) is the bandwidth of communication
c). The rationale is that for the communications with higher bandwidth requirement,
usually more packets need to be processed and routed. Therefore, their impact on
the power consumption distribution is higher. We should desire higher flexibility for
these communications to re-divert the traffic and hence higher adaptivity.
In the following, we use the notations summarized in Table 11.2 and the energy
model described in Sect. 11.4.2.1 to obtain the linear programming (LP) formulation
of the optimal traffic allocation problem. In the LP engine, the inputs are the
application task flow graphs and the admissible path set. The outputs are the path
ratios to be used in the routing at run time.
We assume the energy consumption of each processor i (E p−i ) is available after task
mapping. Wormhole routing is used in our routing scheme. In wormhole routing,
each packet is divided into several flits which are the minimum units for data
transmission and flow control. For every data packet, the head flit sets up the path
directions for the body and the tail flits. Thus, Erc , Esel and Evc in Table 11.2 only
11 A Thermal Aware Routing Algorithm for Application-Specific NoC 289
incur when the head flit is being processed by the router. Total energy consumption
for processing a single packet in router i can be represented as:
Er−i = (Ebu f f er−rw + E f orward + Esw ) × S packet + Erc + Esel + Evc (11.4)
Let ni denote the number of packets that received by router i, then the total router
energy consumption is equal to Er−i = Er−i × ni . The total energy of each tile i
(Ei ) is equal to E p−i + Er−i .
Given the set of admission paths for every communication pair which is deadlock
free, we want to obtain the ratio of traffic allocated to each path so as to minimize
the maximum energy consumption among all the tiles. We formulate the following
linear programming problem to obtain the optimal path ratios:
1. Variables r(i, j, k): portion of traffic allocated to the kth path l(i, j, k) between
tiles i and j among all the Li j paths, where 1 ≤ i ≤ Ntile, 1 ≤ j ≤ Ntile , 1 ≤ k ≤ Li j .
2. Objective functions: The energy consumption of the ith tile, Ei , is given by
where ni is the total number of packets received by router i per unit time and is
equal to the summation of the packets received from all paths passing through
tile i, i.e., Ti . More specifically, ni is given by:
In order to balance the tile energy Ei , our objective function is written in a min-
max form as follows:
It is equivalent to:
3-2. Bandwidth constraints: the aggregate bandwidth used for a specific link
should not exceed the link capacity imposed by the physical NoC platform.
The communication bandwidth = packet injection rate × packet size × clock fre-
quency. Assume T is the cycle time and (i, j) is a physical link in the mesh NoC, a
path l(a, b, k) will traverse this link if and only if l(a, b, k) ∈ Ti ∩ T j . So we have
Taking the application’s task flow graph and the task mapping as input, the packet
injection rate p(a, b) can be calculated by summing the bandwidth volume from
all the communication pairs where the source tasks are mapped onto tile a and
destination tasks are mapped onto tile b.
The above formulation is a typical linear programming problem and can be
solved efficiently using MATLAB CVX optimization toolbox [8].
After solving the LP problem, we obtain a set of admissible paths and their corre-
sponding traffic allocation ratios. To use these information in the implementation of
the NoC routing, we can use two schemes. The simplest way is source routing. In the
source processor, the header flit of each packet contains the entire path information.
The source processor i decides which path to use to send the packet to destination
j according to the traffic allocation ratios r(i, j, k) of the set of admissible paths.
Intel Teraflops chip [16] uses this source-based routing scheme as the router will be
simple. However, this will create a large overhead on the effective bandwidth as the
packet needs to contain additional payload to record the entire routing path.
A more efficient way of implementing the thermal aware routing is to use routing
tables stored in each router. One of the major advantages of table-based routing is
that it can be dynamically reconfigured or reloaded [22] to allow modifications in the
communication requirements. However since the routing decision is made within
each router without knowing the entire path information, the path traffic-allocation
ratios obtained previously cannot be directly used. In the following, we will show
how to convert these ratios into local probabilities for the router to select which
output port to send out the packet.
To support the thermal-aware routing, the routing table is organized as follows:
for each router t in the mesh topology and each of its input direction d ∈ Din (t), there
is a routing table RT (t, d) [22]. For each output port o, there is a set of corresponding
entries in the routing table. Each entry is made up of the tile id numbers of the source
(s), and destination (b) pair, as well as the probability values p(o) of using o to route
to b. Formally, RT (t, d) = {(s, b, o, p(o))|o ∈ Dout (t), 0 ≤ p(o) ≤ 1}.
When using routing tables within the routers, the final routing path for a packet of
a specific traffic flow is composed by the output port selected in each router along the
11 A Thermal Aware Routing Algorithm for Application-Specific NoC 291
path distributively. One problem may hereby arise: will the selected output ports in
the routers form a new path that is not included in the admissible path set and hence
introduce the possibility of deadlock? In the following, we prove that path generated
at run-time using the table-based routing is indeed deadlock free provided the path
set used for generating the routing table RT (t, d) contains no circles.
Lemma 1 (deadlock-free property in distributive routing). For the table-based
routing, after generating the distributed routing table for each router according to
the deadlock free path set P, the actual route generated at run time will not contain
a disallowed path not included in P. Deadlock-free property is hence inherited from
the path set P.
represented as pt,d (o1 , s, b) and pt,d (o2 , s, b). They are calculated by comparing the
aggregate traffic of the paths locating in Tt,d that use ports o1 and o2 , respectively,
to route to the tile b. Following the notations in Table 11.1, let paths l(s, b, k) ∈
Tt,d (o1 , s, b) and l(s, b, l) ∈ Tt,d (o2 , s, b), we have:
After enumerating all the routers and the communication pairs, the port probability
values are obtained and stored into the routing tables offline.
In Eqs. 11.12 and 11.13, in order to maintain the global path traffic allocation
ratios, each entry in the routing table RT (t, d) needs to distinguish the source tile
location s of the packet. Therefore the number of entries in the routing table is
increased. We can optimize the routing table size by grouping the entries of different
sources but the same destinations together. The new format of the routing table
becomes RT (t, d) = {(b, o, p(o))|o ∈ Dout (t), 0 ≤ p(o) ≤ 1}. In this case, the port
probability value p(o, b) in RT (t, d)) is calculated as:
In Eqs. 11.14 and 11.15, although the exact path traffic allocation ratios cannot
be kept, the traffic distribution on each router and each link can still be maintained
which is more important to delivery the energy distribution more uniformly. From
the simulation results which will be presented in Sect. 11.6, this routing table
implementation achieves similar improvement in peak energy reduction and latency
performance compared to the routing table using the source-destination pair.
The block diagram of NoC router design to support thermal aware routing is
illustrated in Fig. 11.7a. For minimum path routing, if the input direction and the
destination are fixed, there are at most two candidate output ports, o1 and o2 , and
p(o1 ) + p(o2 ) = 1. The routing selection unit in the router selects the output port
Odir by comparing the probability value p(o1 ) with a random number τ in [0, 1]. It
will select port O1 i f τ ≤ p(o1 ), otherwise port O2 will be chosen.
A pseudo random number generator using linear feedback shift register (LFSR)
is employed to generate the random number at run time.
The header flits at each input ports are first decoded by a parser (the HPU module
shown in Fig. 11.7b) to extract their destinations. Then by inquiring RT (t, d), two
candidate output ports are returned with the corresponding probability values. The
11 A Thermal Aware Routing Algorithm for Application-Specific NoC 293
a c
Output port Bypass
North ckecker
Valid_1
Routing Valid_2
West Router i Computation Outport_1 Sel
East
Mux 2
Outport
Bp_1
Routing Mux 1
Table Outport_2
Sel
Bp_2
VC allocation
Ratio
>=
Compar
LFSR
ator
CLK
Local South
Pseudo Random
b number
Header Parser Unit for Routing table Look Up Unit Output port selection unit
destination id
Outport_1_L Outport_Local
Local Routing Table for
HPU_L Outport_2_L
Dst_L Local R_sel_L
Port_ratios_L
West Routing Table for Outport_West
HPU_W West R_sel_W
Dst_W
Fig. 11.7 Block diagram of the router supporting ratio-based routing (a) Router microarchitecture
(b) Routing computation unit (c) Ratio-based output port selection
output port selection unit then make a decision on the output port. In addition to the
probability selection, backpressure information (Bp_1 and Bp_2 in Fig. 11.7c) from
downstream routers are also taken into consideration. If one candidate output port
is not available due to limited buffer space, the backpressure signal will disable this
output port in the selection.
Routing Adaptivity
0.6
Multimedia Benchmark
First we compare the routing adaptivity of the admissible path sets generated
by the proposed cycle removal algorithm with other turn model based routing
algorithm under several real benchmarks. The results are shown in Fig. 11.8. It can
be seen that 20–30 % more paths are available if we take the application traffic into
consideration, which indicates a better opportunity of using more paths to even out
the NoC power profile.
11 A Thermal Aware Routing Algorithm for Application-Specific NoC 295
1.8
1.6
1.4
Normalized Peak Energy
1.2
1.0
0.8
West first adaptive routing
North last adaptive routing
0.6
Negative first adaptive routing
0.4 Oddeven adaptive routing
XY routing
0.2 proposed opt_source_dest routing
proposed opt_dest routing
0.0
PIP MWD MPEG4 VOPD MMS_1 MMS_2 DVOPD
Multimedia Benchmark
a 300
West first routing 1.6
270 North last routing
Negative first routing
240 1.4
Oddeven routing
Average delay (cycles)
210 XY routing
proposed opt_source_dest routing 1.2
180 proposed opt_dest routing
150
1.0 West first routing
120 North last routing
Negative first routing
90 Oddeven routing
0.8
60 XY routing
proposed opt_source_dest routing
30 0.6 proposed opt_dest routing
opt_peak (by cvx)
0
0.010 0.012 0.014 0.016 0.018 0.020 0.022 0.024 0.026 0.028 0.000 0.005 0.010 0.015 0.020 0.025
Packet injection rate (packets/cycle/node) Packet injection Rate (packets/cycle/node)
b 200
West first routing 1.6
180
North last routing
160 Negative first routing
1.4
Oddeven routing
Average delay (cycles)
140
XY routing
120 proposed opt_source_dest routing 1.2
100
proposed opt_dest routing
c 300
1.5
West first routing
North last routing
250
Negative first routing
1.2
Oddeven routing
Average delay (cycles)
200 XY routing
proposed opt_source_dest routing
0.9
propsoed opt_dest routing
150
West first routing
0.6 North last routing
100 Negative first routing
Oddeven routing
0.3 XY rouitng
50 proposed opt_source_dest routing
proposed opt_dest routing
opt_peak (by cvx)
0 0.0
0.012 0.014 0.016 0.018 0.020 0.022 0.024 0.026 0.028 0.000 0.005 0.010 0.015 0.020 0.025
Packet injection rate (packets/cycle/node) Packet injection Rate
Fig. 11.10 Latency and peak energy simulation for (a) Random traffic, (b) Transpose-1 traffic and
(c) Transpose-2 traffic
is same, i.e., 4.55 × 10−5 J. However, the peak tile energy of the proposed, odd-even,
negative-first and XY routing are 5.95 × 10−6 J, 8.09 × 10−6 J, 7.53 × 10−6 J and
7.72 × 10−6 J respectively. From the figures, we can see that our proposed scheme
indeed leads to a more even energy profile across the NoC chip.
11 A Thermal Aware Routing Algorithm for Application-Specific NoC 297
a
300
1.6
West first routing
North last routing
250
Negative first routing 1.4
Oddeven routing
Average delay (cycles)
200 XY routing
proposed opt_source_dest routing 1.2
proposed opt_dest routing
150
1.0 West first routing
North last routing
100 Negative first routing
0.8 Oddeven routing
XY routing
50 proposed opt_source_dest routing
0.6 proposed opt_dest routing
opt_peak (by cvx)
0
0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.000 0.005 0.010 0.015
Packet injection rate (packets/cycle/node) Packet injection Rate
b 1.8
300
West first routing
North last routing 1.6
250
Negative first routing
Average delay (cycles)
c 1.6
250
West first routing
North last routing 1.4
200 Negative first routing
Average delay (cycles)
Oddeven routing
XY routing 1.2
150 proposed opt_source_dest routing
proposed opt_dest routing
1.0 West first routing
100 North last routing
Negative first routing
0.8 Oddeven routing
XY routing
50
proposed opt_source_dest routing
proposed opt_dest routing
0.6
opt_peak (by cvx)
0
0.012 0.014 0.016 0.018 0.020 0.022 0.024 0.026 0.028 0.000 0.005 0.010 0.015 0.020 0.025
Packet injection rate (packets/cycle/node) Packet injection Rate (packets/cycle/node)
Fig. 11.11 Latency and peak energy simulation for (a) Hotspot in right-side (Hotspot-Rs) traffic,
(b) Hotspot in top-right corner (Hotspot-Tr) traffic and (c) Bursty traffic
In Table 11.3, we summarize the execution time of our off-line routing algorithm
(including path generation and LP solving) under various mesh size and communi-
cation density ρ (ρ is defined as the ratio of the total number of communications
pairs to the number of processors in mesh). It can be seen that the execution time is
298 Z. Qian and C.-Y. Tsui
a 250
1.6
West first routing
North last routing
200 Negative first routing 1.4
Oddeven routing
Average delay (cycles)
XY routing
150 proposed opt_source_dest routing 1.2
proposed opt_dest routing
b
140 West first routing 1.6
North last routing
120 Negative first routing 1.4
Average delay (cycles)
Oddeven routing
100 XY routing
proposed opt_source_dest routing 1.2
80 proposed opt_dest routing
Fig. 11.12 Latency and peak energy simulation for (a) Hotspot in center (Hotspot-C) traffic and
(b) Butterfly traffic
reasonable. For larger mesh size (7 × 7 or more) and more communication pairs (100
or more), the number of minimum paths increase dramatically. It takes longer time
(2–3 h in average) to obtain the traffic allocation ratios. Since the traffic allocation
is determined offline, in most cases, the time cost is still affordable. In case we want
to reduce the execution time, we can restrict the number of minimum-length paths
to a smaller subset.
The energy consumption of a tile consists of that of the processor and the router.
For different applications, the power contribution of the router and the processor
differs greatly. Our thermal-aware routing algorithm can only re-distribute the router
power consumption. Under this constraint, we next evaluate the effectiveness of our
algorithm on the peak energy reduction when the ratio of the energy contribution
11 A Thermal Aware Routing Algorithm for Application-Specific NoC 299
a b
-6
7.0x10
-6
7.0x10
-6
6.0x10
-6
6.0x10
-6
5.0x10 -6
5.0x10
Peak Energy
Peak Energy
-6
4.0x10 -6
4.0x10
-6
3.0x10 3.0x10
-6
-6 -6
2.0x10 2.0x10
-6 -6
1.0x10 1.0x10
3
0.0
0.0 2 3
0
0 1 2
w
1 1
Ro
1
w
Colu 2 0 Colu 2
Ro
mn mn 0
c
3 d 3
-6 -6
7.0x10 7.0x10
-6 -6
6.0x10 6.0x10
-6 -6
5.0x10 5.0x10
Peak Energy
-6 -6
4.0x10 4.0x10
Peak Energy
-6 -6
3.0x10 3.0x10
-6 -6
2.0x10 2.0x10
-6 -6
1.0x10 1.0x10
3 3
0.0 2 0.0 2
0 1 0 1
w
w
1 1
Ro
Ro
Colu 2 0 2 0
mn Colu
3 mn 3
Fig. 11.13 An example of the NoC energy profile under hotspot-4c traffic (a) proposed routing
(b) oddeven routing (c) negativefirst routing (d) XY routing
from the router and the processor varies. We define re = Average processor energy
Average router energy to
reflect this ratio. The experimental results shown in the previous sections assume
re = 1. In this sub-section, we simulate the peak energy reduction using our routing
scheme for different re values. Tables 11.4–11.6 summarize the results for synthetic
traffic and real benchmarks, respectively. From the results we can see that when re
increases, the peak energy reduction is smaller because the relative contribution of
the router energy reduces. Overall, we can achieve an average 6–17 % peak energy
reduction over XY and odd-even routing schemes when re ranges from 0.67 to 4
across all the benchmarks.
300
Table 11.4 Peak energy reduction under various re (for Random, Transpose-1, Transpose-2 and Hotspot-C traffic)
Peak energy reduction
Benchmark Random Transpose-1 Transpose-2 Hotspot-C
Average energy ratio (re ) vs. XY (%) vs. OE (%) vs. XY (%) vs. OE (%) vs. XY (%) vs. OE (%) vs. XY (%) vs. OE (%)
0.67 17.4 12.8 15.6 17.9 9.5 12.1 15.7 17.6
1.00 15.7 15.3 13.3 14.2 10.3 10.3 14.4 16.2
1.33 13.3 9.5 12.5 14.5 7.2 9.4 13.2 14.9
1.67 10.6 8.6 11.2 13.8 6.4 9.4 12.3 14.0
2.00 11.9 9.3 11.1 13.8 6.5 9.1 11.6 15.0
2.33 10.5 8.2 8.5 10.7 5.5 7.0 10.8 13.9
2.67 9.4 7.7 8.9 11.1 4.3 6.9 10.2 13.0
3.00 8.8 7.0 8.8 10.6 3.7 5.6 9.6 10.9
3.33 7.9 5.9 7.1 9.9 4.2 6.6 9.1 10.5
3.67 8.1 6.5 7.8 9.2 4.1 5.8 8.7 9.9
4.00 7.9 6.1 7.0 9.0 4.6 6.4 8.3 9.4
Average 11.04 8.81 10.16 12.25 6.03 8.05 11.26 13.21
Z. Qian and C.-Y. Tsui
Table 11.5 Peak energy reduction under various re (for Butterfly, Hotspot-Tr, Hotspot-Rs and Bursty traffic)
Peak energy reduction
Benchmark Butterfly Hotspot-Tr Hotspot-Rs Bursty
Average energy ratio (re ) vs. XY (%) vs. OE (%) vs. XY (%) vs. OE (%) vs. XY (%) vs. OE (%) vs. XY (%) vs. OE (%)
0.67 14.7 20.2 16.6 9.9 21.4 11.2 12.6 16.1
1.00 13.1 19.1 15.4 9.2 19.9 10.5 10.4 13.0
1.33 10.8 15.2 13.6 7.9 18.5 9.7 8.1 11.3
1.67 9.8 14.4 12.5 7.3 17.2 8.9 7.5 9.9
2.00 9.9 13.7 11.6 6.7 16.3 9.4 5.6 9.1
2.33 8.9 13.0 11.0 6.5 15.5 8.2 7.1 8.7
2.67 8.8 12.8 10.1 5.8 14.6 7.8 5.3 7.9
3.00 8.7 11.2 9.5 5.4 13.8 7.2 5.5 7.6
3.33 7.4 9.8 8.9 5.1 13.1 6.9 4.9 7.2
3.67 5.8 9.4 8.4 4.8 12.5 6.6 4.3 6.0
4.00 6.0 8.8 8.0 4.6 12.0 6.3 4.5 6.4
11 A Thermal Aware Routing Algorithm for Application-Specific NoC
Table 11.6 Peak energy reduction under various re (for real benchmark traffic)
Peak energy reduction
Benchmark MMS-1 MPEG4 VOPD
Average energy ratio (re ) vs. XY (%) vs. OE (%) vs. XY (%) vs. OE (%) vs. XY (%) vs. OE (%)
0.67 17.7 11.8 7.8 10.2 16.5 28.9
1.00 15.7 10.4 6.9 9.0 12.3 23.6
1.33 14.5 9.8 6.1 8.0 10.3 20.5
1.67 12.5 8.1 5.5 7.3 9.9 19.0
2.00 10.6 6.5 5.4 7.1 9.3 17.7
2.33 9.3 5.6 5.0 6.5 8.3 16.1
2.67 10.4 7.0 4.6 6.0 7.6 14.8
3.00 9.7 6.5 4.3 5.6 7.6 14.2
3.33 8.5 5.6 4.0 5.5 7.7 13.8
3.67 7.9 5.2 4.2 5.4 7.1 13.0
4.00 7.3 4.7 3.9 5.0 6.5 12.0
Average 11.28 7.38 5.25 6.87 9.37 17.6
Z. Qian and C.-Y. Tsui
11 A Thermal Aware Routing Algorithm for Application-Specific NoC 303
11.7 Conclusions
NoC has been widely adopted to handle the complicate communications for future
MPSoCs. As temperature becomes one key constraint in NoC, in this chapter, we
propose an application-specific and thermal-aware routing algorithm to distribute
the traffic more uniformly across the chip. A deadlock free path set finding algorithm
is first utilized to maximize the routing adaptivity. A linear programming (LP)
problem is formulated to allocate traffic properly among the paths. A table-based
router is also designed to select output ports according to the traffic allocation ratios.
From the simulation results, the peak energy reduction can be as high as 20 % for
both synthetic traffic and real benchmarks.
References
12.1 Introduction
12.1.1 Background
As the advances of semiconductor technology, there are more and more components
needs to be integrated within a chip. For high-performance chip multi-processors
(CMPs) or multi-processor SoCs (MPSoCs), tens or even hundreds of processor
cores, memories, and other components are required to be connected. On-chip inter-
connection gradually becomes a major challenge for optimization of performance,
cost, and power consumption in system level [1].
Figure 12.1 shows the trend of the on-chip interconnection in the view of
architecture. The traditional point-to-point interconnection suffers from the high
complexity of wire routing, which leads to large layout area and long transmission
delay. The shared bus architecture suffers from its limited bandwidth. The low scala-
bility of these two traditional paradigms make them insufficient to accommodate the
communication requirements with predictable performance and design effort. By
viewing the on-chip interconnection as a micro-network, Network-on-Chip (NoC)
has been viewed as a novel and practical solution [2]. Many topologies, such as ring,
two-dimensional mesh (2D mesh), two-dimensional torus (2D torus), star, octagon,
were discussed in the literatures for NoC. Among these topologies, mesh-based
topologies are very popular in research and commercial fields. Due to the regularity,
2D mesh has small layout overhead and good electrical properties. Therefore, mesh-
based topologies are popular for homogeneous multi-/many-core systems [3–5].
The multicore systems will support the performance improvement until the
interconnection becomes the bottleneck [6]. The limitation had been addressed
in [1]. In order to remain fully wirable. Figure 12.2 provides a projection of
number of wire levels in an SoC system [1]. Obviously, as following the Moore’s
Law, the improbable 90 metal levels will be required, which is an impractical
solution. Recently, the emerging Through Silicon Via (TSV)-based die stacking
three-dimensional (3D) IC process provides a novel connecting approach, which can
reduce the wire delay between dies [7]. By combining with the 3D IC technology,
three-dimensional Network-on-Chip (3D NoC) has the following three advantages
over traditional 2D NoC, which can achieve higher performance with less power
and smaller form factor for on-chip data transference [8]:
12 Traffic- and Thermal-Aware Routing Algorithms for 3D Network-on-Chip. . . 309
Fig. 12.2 Projection of wire levels as the technology scaling down (©2001 IEEE. Reprinted, with
permission, from [1])
Fig. 12.4 Thermal impact of processor cores and on-chip network (©2004 IEEE/ACM. Reprinted,
with permission, from [9])
Fig. 12.5 (a) Factors of 3D thermal problem; (b) Temperature distribution is higher and wider,
making more hotspots (©IEEE. Reprinted, with permission, from [16])
In conventional 2D IC, the power density is increased with respect to the technology
scaling down, which makes the thermal issues have been major factors [6]. The
increasing power density increases the heat generation rate of unit chip area, leading
to high temperature. High temperature results in slower circuit switching, larger
leakage power, and higher vulnerability of thermal run-away. In traditional 2D
NoC systems, routers have been proven to have comparable thermal impact as
processors, contributing significant thermal overhead to overall chip [9–11], as
shown in Fig. 12.4. The reason is that the power density of the router is similar
to or even higher than the average power density of the processor [11], owing to the
high switching activity in the routers. Therefore, NoC router is one of the sources
generating thermal hotspots [11, 12].
For a k-tier 3D IC platform, the problem of power density will be k times
higher than the traditional 2D IC with the same footprint and process technology.
Therefore, the thermal problem is severer and has been viewed as a major issue
of 3D IC. Because of the stacking structure of 3D IC, the factors that worsen
the thermal problem of 3D NoC include: (i) the high switching activity of each
router, (ii) the longer heat conduction path, (iii) the larger cross-sectional power
density, (iv) varying cooling efficiency, which are shown in Fig. 12.5a. The mean
12 Traffic- and Thermal-Aware Routing Algorithms for 3D Network-on-Chip. . . 311
temperature of the NoC is pushed higher by the larger power density and the longer
average heat conduction path. The temperature variance is increased due to the
varying cooling efficiency in different layers. Figure 12.5b shows that the 3D NoC
inevitably tends to have more routers’ temperature exceed the thermal limit because
of the higher mean and larger variance of the temperature distribution. The crucial
high temperature produces unsafe working status of the chip.
For thermal safety, the system generally requires a better cooling solution.
However, the cost of the heat sink grows exponentially as its cooling capability.
Consequently, it is infeasible to eliminate hotspot solely by enhancing the cooling
device. In this chapter, we will investigate some thermal-aware routing algorithms,
which makes the 3D NoC system can still be thermal-safe without the enhancement
of cooling devices.
The traditional design issues of traffic- and thermal-aware routing can be catego-
rized into two different types: (i) off-line type and (ii) on-line type, which are for
two different system types. The key difference of these two types is that whether the
transient temperature profile of the system is obtainable during operation. We will
address the two different types of design problems in the section.
Fig. 12.8 (a) Global throttling (GT) scheme throttles entire network; (b) Distributive throttling
(DT) scheme only throttles the overheated node; (c) Thermal-aware vertical throttling (TAVT)
determines the throttling state based on the level of thermal-emergence
The simplest RTM uses global throttling (GT) scheme to cool down the network
[13]. When any node is near overheat (i.e., the temperature of the node exceeds
the TT ), the entire network will be slowed down, as shown in Fig. 12.8a. Although
the GT has short throttling time, the impact of availability is huge. To reduce the
performance impact as applying the GT scheme, a distributive and collaborative
throttling scheme, ThermalHert, was proposed [9]. The distributed traffic throttling
(DT) controls the quota of incoming traffic of the node, while the temperature
exceeds the TT , as shown in Fig. 12.8b. For the 3D NoC systems, the heterogeneous
thermal conductance may result in long cooling time for the nodes, which are
314 K.-C. Chen et al.
far away from the heat sink, while employing the DT scheme. To provide an
effective heat conductance path within short response time, thermal-aware vertical
throttling (TAVT) scheme was proposed in [15]. Different from the DT scheme, the
granularity of thermal control is a pillar, which consists of the nodes with the same
XY address as the near-overheated node. Based on the level of thermal emergency,
TAVT determines different throttling states (i.e., the number of nodes in a throttling
pillar), as shown in Fig. 12.8c. However, the GT, DT, and TAVT results in topology
change. The traditional routing algorithms result in blocking packets in the network,
which leads to severe traffic congestion and large performance impact. In Sect. 12.4,
some on-line thermal-aware routing algorithms will be introduced to reduce the
performance impact while the RTM is triggered.
Figure 12.6 shows the two extreme design schemes, Load Balanced Design (LBD)
scheme and Traffic Balanced Design (TBD). The problem of TBD is the consequent
requirement of unbalanced vertical traffic loading. On the other hand, the LBD
scheme results in unbalanced temperature profile, although it can leads to the most
balanced vertical traffic loading. Due to the heterogeneous thermal conductance of
each silicon layer of 3D NoC system, these two factors resist the performance gain
of 3D stacking [14]. In the section, we will presents a effective, controllable, and
systematic approach to amend the shortages of the LBD and the TBD schemes.
Heat conduction in 3D NoC system can be modeled using Fourier’s heat flow
analysis. Fourier’s analysis has been the academic and industrial standard method
for circuit-level, architecture-level, and chip-package thermal analysis in the past
decades. The method is analogous to Ohm’s method of modeling electrical circuit.
Heat flow is analogous to electrical current, and temperature is analogous to voltage.
Each element of temperature profile is determined by power, thermal conductance,
and thermal capacitance. Therefore, the thermal model of single tile, which contains
one router, one processor, and one memory, can be constructed as Fig. 12.9a. The
temperature of router, memory, and processor at node (x, y, z) are Tx,y,z
R , T M , and
x,y,z
P ; the corresponding power are PR , PM , and PP . As shown in Fig. 12.9b, the
Tx,y,z x,y,z x,y,z x,y,z
X by Y by Z 3D NoC is composed of identical single tiles. The involved notations in
the analysis of this section are shown in Table 12.1.
12 Traffic- and Thermal-Aware Routing Algorithms for 3D Network-on-Chip. . . 315
Fig. 12.9 (a) Thermal model of single tile, and (b) The abstracted model of an X by Y by Z 3D
NoC systems (©IEEE. Reprinted, with permission, from [16])
For a 3D NoC system, the temperature profile TR of routers and the power profile
PR of routers can be represented as follows:
⎡ ⎤ ⎡ R ⎤
TR1,1 . . . TRX,1 Tx,y,1
⎢ .. ⎥ , and TR = ⎢ .. ⎥
TR = ⎣ ... ..
. . ⎦ x,y ⎣ . ⎦ (12.1)
TR1,Y . . . TX,Y
R R
Tx,y,Z
⎡ ⎤ ⎡ P ⎤
PR1,1 . . . PRX,1 Tx,y,1
⎢ .. . . . ⎥ ⎢ .. ⎥
P =⎣ .
R
. . ⎦ , and Px,y = ⎣ . ⎦
. R
(12.2)
P1,Y . . . PRX,Y
R P
Tx,y,Z
x, y is the 1D vertical temperature profile of the routers with the same (x, y) address,
TR
and TR is the entire 3D temperature profile of routers. PR x, y is the 1D vertical power
profile of the routers with the same (x, y) address, and PR is the entire 3D power
316 K.-C. Chen et al.
profile of routers. Similarly, the temperature profile and power profile of memory
part are represented by TM and PM , and the temperature profile and power profile
of processor part are represented by TP and PP .
The design goal of the thermal-aware routing is to maximize the network
throughput and keep temperature below thermal limit. Through redistributing the
traffic load L, the power profile PR of the overheated case TR will be reduced.
Therefore, the power profiles of the memories and the processors (i.e., PM and PP )
are not changed. L is defined as:
⎡ ⎤ ⎡ ⎤
L1, 1 . . . LX, 1 Lx,y,1
⎢ ⎥ ⎢ ⎥
L = ⎣ ... . . . ... ⎦ , and Lx,y = ⎣ ... ⎦ . (12.3)
L1, Y . . . LX, Y Lx,y,Z
The channel bandwidth LBW is an upper bound that limits the transfer rate. The
channel loading is bounded as:
where the “BW − LBD” represents the bandwidth-bound of LBD. Because the
channel loading affects the switching activities of each router, we assume that
the power of a router is an increasing function f of the channel loading (i.e.,
PLBD = fL→P (LLBD )). As shown in (12.6), PR is balanced while L is balanced.
⎡ R ⎤ ⎡ ⎤ ⎡ LBD ⎤
Px,y,1 fL→P (LLBD ) P
⎢ .. ⎥ ⎢ . ⎥ ⎢ .. ⎥
Px,y |∀x,y = ⎣ . ⎦ = ⎣
R .
. ⎦ = ⎣ . ⎦. (12.6)
R
Px,y,Z fL→P (L )
LBD PLBD
According to Fourier’s Heat Conduction Law, the most heat is dissipated through
the vertical direction in 3D stacking chip [16]. Therefore, the vertical temperature
profile TRx,y of each (x, y) is mainly determined by the vertical power profile PRx,y . In
LBD, the vertical temperature profile results in the thermal differences as:
12 Traffic- and Thermal-Aware Routing Algorithms for 3D Network-on-Chip. . . 317
⎡ ⎤ ⎡ R ⎤
R
Tx,y,1 Tx,y,2 + PLBD · g−1
x,y,1
⎢ . ⎥ ⎢ .. ⎥
⎢ .. ⎥ ⎢ ⎥
⎢ ⎥ ⎢ . ⎥
⎢
Tx,y = ⎢T R
R ⎥ ⎢
= ⎢T R + PLBD · g−1 ⎥. (12.7)
x,y,Z−1 ⎥ ⎥
⎢ ⎥ ⎢ x,y,Z x,y,Z−1 ⎥
⎣ ⎦ ⎣ ⎦
R
Tx,y,Z T A + PLBD · g−1
x,y,Z
T A ≤ Tx,y,Z
R
≤ Tx,y,Z−1
R
≤ · · · ≤ Tx,y,2
R
≤ Tx,y,1
R
≤ T L. (12.8)
Since we must keep the temperatures of all routers not above the thermal limit, the
channel loading LLBD is bounded. By combining (12.7) and (12.8), we can derive the
thermal-limited bound of channel loading as using LBD, which can be described as:
Z
−1
LLBD = fL→P −1
(PLBD ) ≤ fL→P ((T L − T A )/ ∑ g−1
x,y,z ) = L
T L−LBD
, (12.9)
z=1
where the “T L − LBD” represents the thermal-limit bound of LBD. The optimal
criterion of LBD scheme is that the channel loading of the bandwidth bound
LBW −LBD is a tighter bound than the one of the thermal-limited bound LT L−LBD .
Because T L is not extremely high in the most cases, the LT L−LBD is usually smaller
than LBW −LBD . Therefore, LBD scheme cannot guarantee the optimal network
throughput.
The goal of TBD scheme is to balance the temperature profile for eliminating
hotspots. Assume there is only one heat sink at the bottom of the chip, as shown
in Fig. 12.8. The optimum result of temperature balancing scheme is that the every
R in the temperature profile is equal to thermal limit T L , which is shown as
Tx,y,z
follows:
⎡ R ⎤ ⎡ T BD ⎤
Tx,y,1 T
⎢ .. ⎥ ⎢ .. ⎥
Tx,y |∀x,y = ⎣ . ⎦ = ⎣ . ⎦ , (TBD scheme).
R
(12.10)
R
Tx,y,Z T T BD
In usual, the T T BD is less than the T L . By following (12.10), the heat transfer
toward ambient is maximum, if the temperature difference between bottom layer
and ambience is T L − T A , and the temperature difference between vertical neighbor
318 K.-C. Chen et al.
routers is zero. Such result leads to the extreme unbalanced vertical power profile
for each PRx,y , as follows:
⎡ R ⎤ ⎡ ⎤
Px,y,1 (Tx,y,1
R −TR )·g
x,y,2 x,y,1
⎢ . ⎥ ⎢ .. ⎥
⎢ .. ⎥ ⎢ ⎥
⎢ ⎥ ⎢ . ⎥
⎢
Px,y = ⎢PR
R ⎥ ⎢
= ⎢(T R ⎥
x,y,Z−1 ⎥ x,y,Z−1 − Tx,y,Z ) · gx,y,Z−1 ⎥
R
⎢ ⎥ ⎢ ⎥
⎣ ⎦ ⎣ ⎦
R
Px,y,Z (Tx,y,Z
R − T A) · g
x,y,Z
⎡ ⎤ ⎡ ⎤ (12.11)
0 · gx,y,1 0
⎢ .. ⎥ ⎢ .. ⎥
⎢ . ⎥ ⎢ . ⎥
⎢ ⎥ ⎢ ⎥
= ⎢ 0 · gx,y,Z−1 ⎥ = ⎢ 0 ⎥
⎢ ⎥ ⎢ ⎥
⎣ ⎦ ⎣ ⎦
(T L − T A ) · gx,y,Z PT BD
To fit such power profile, all the traffic must concentrate on the bottom silicon layer
and lead to huge traffic congestion. Therefore, all the channel bandwidth in the
routers in the upper Z − 1 silicon layers is not utilized, which is shown as follows:
⎡ ⎤ ⎡ −1 R ⎤ ⎡
−1 (0)
⎤ ⎡ ⎤
Lx,y,1 fL→P (Px,y,1 ) fL→P 0
⎢ .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥ ⎢ . ⎥
⎢ . ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ .. ⎥
⎢ ⎥ ⎢⎢
. ⎥ ⎢
⎥
. ⎥ ⎢ ⎥
Lx,y = ⎢Lx,y,Z−1 ⎥ = ⎢ f −1 (PR ⎢
= ⎢ f −1 (0) ⎥ = ⎢ 0 ⎥ . (12.12)
⎢ ⎥ ⎢ L→P x,y,Z−1 )⎥ ⎥ ⎢ L→P
⎥
⎥ ⎣ ⎢ ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎦
Lx,y,Z −1 (PR ) −1 (PT BD )
fL→P LT BD
fL→P x,y,Z
LT BD ≤ LBW −T BD . (12.14)
The optimal criterion of TBD scheme is that the channel loading of thermal
limited bound LT L−T BD is a tighter bound than the one of the bandwidth bound
LBW −T BD . By observing (12.13), if T L is not very low, LT L−T BD will be relaxed,
resulting in the possibility that LT L−T BD becomes larger than LBW −T BD . Therefore,
the TBD scheme cannot guarantee to achieve the optimal throughput.
To discuss whether LBD or TBD can achieve maximal throughput for 3D NoC,
we have to compare the bandwidth bound and the thermal-limited bound of LBD
and TBD. Table 12.2 shows all the four cases. In Case 1, both criteria are satisfied,
so LBD and TBD are both optimal. However, by comparing (12.9) and (12.13), Case
1 only happens when Z = 1, which is a 2D NoC case with very high thermal limit.
12 Traffic- and Thermal-Aware Routing Algorithms for 3D Network-on-Chip. . . 319
Case 2 shows the situation that thermal limit is very low, leading to non-optimal LBD
and optimal TBD. Case 3 shows the situation that thermal limit is very high, resulting
in optimal LBD and non-optimal TBD. Case 4 shows the situation that a middle
thermal limit is given. In this case, none of LBD and TBD can guarantee maximal
throughput. Therefore,we need a new design scheme to amend the shortages of the
LBD scheme and the TBD scheme, which will be introduced latter.
Fig. 12.10 Desired traffic loading for (a) high thermal limit cases, adopting LBD scheme;
(b) middle thermal limit cases, adopting a mixture of LBD and TBD scheme; (c) low thermal
limit cases, adopting TBD scheme (©IEEE. Reprinted, with permission, from [16])
Fig. 12.11 (a) In traditional dimension-ordered routing, the lateral routing layer is identical to the
source layer. (b) With traffic migration D = 1, the lateral routing layer is one layer below the source
layer. (c) The lateral routing is an adaptive routing (©IEEE. Reprinted, with permission, from [16])
preferred, and the target vertical distribution is (12.12), as shown in Fig. 12.10c. If
the thermal limit is in the middle, both temperature and channel loading may limit
the throughput. Therefore, a vertical traffic distribution between (12.4) and (12.12)
is preferred to achieve maximum throughput, as shown in Fig. 12.10b.
To accommodate these three traffic loadings shown in Fig. 12.10, the VDLAPR
algorithm was proposed in [16]. Figure 12.11 shows the routing scheme of traffic
migration. Traditional dimension-ordered routing algorithms, such as XYZ routing
in Fig. 12.11a, cannot accommodate all the vertical traffic requirements of different
thermal limits shown in Fig. 12.10. We relieve the minimal routing constraint
in vertical direction and define downward level D, a scalar control parameter
representing the vertical misroute distance, for gradually redistributing the traffic.
Assume the source address of a packet is (XS , YS , ZS ), and the destination address
12 Traffic- and Thermal-Aware Routing Algorithms for 3D Network-on-Chip. . . 321
is (XD , YD , ZD ). With given downward level, the address of target lateral routing
layer of the packet, defined as Z T LRL , can be found by the following equation:
Z T LRL = Z S + D. (12.15)
Because the routing layer is within the 3D mesh, the actual lateral routing layer
Z T LRL is bounded by [1,Z], as:
⎧
⎨ 1 , if Z T LRL < 1
Z LRL
= Z , if Z T LRL > Z (12.16)
⎩
Z T LRL , otherwise
Fig. 12.12 Turn limitation in 3D space (©IEEE. Reprinted, with permission, from [16])
The VDLAPR resist the turn model based approach to prevent deadlock. As
shown in Fig. 12.12, the deadlock-freedom of the turns on XY-plane is achievable
through the Odd-Even turn model. The circular waiting condition is never true
on XZ-plane and YZ-plane since we remove the turns of Up-then-East (UE), Up-
then-West (UW), Up-then-North (UN), and Up-then-South (US). Therefore, we can
guarantee the routing algorithm is deadlock-free.
For supporting the method of routing-based traffic migration, we extend the queuing
model proposed in [18], which adopts M/M/1/K queues. Several assumptions
are made for simplification of the derivation: Poisson packet arrival time and
exponential service time for each channel are required. Besides, each packet is
viewed as an atomic data in this derivation, and the network system is near full-
loading (i.e., the throughput of each router is near maximum). We use the north
input channel of the router at (x,y,z) as an example. Assume the network is not
overloaded in steady state, the main target of buffer allocation is to calculate the full
probability for each channel. For channel Cx,y,z,N , the full probability bx,y,z,N is:
1 − ρx,y,z,N k λx,y,z,N
bx,y,z,N = × ρx,y,z,N
x,y,z,N
, where ρx,y,z,N = . (12.17)
1+kx,y,z,N
1 − ρx,y,z,N μx,y,z,N
The full probability depends on two parameters: arrival rate λx,y,z,N and service
rate μx,y,z,N , as shown by (12.17). However, the λx,y,z,N becomes unpredictable,
if the path selection depends on the channel status. Besides, the μx,y,z,N of one
router depends on the full probability of all its downstream channels because of
12 Traffic- and Thermal-Aware Routing Algorithms for 3D Network-on-Chip. . . 323
the connection of an NoC system. Consider some of the downstream packets are
delivered toward east, which is to Cx+1,y,z,W . The full probability bx+1,y,z,W will
affect the effective service rate of Cx,y,z,W . The effective service rate of Cx+1,y,z,W
can be approximated by 1/bx+1,y,z,W . When Cx+1,y,z,W is full, the reciprocal of
the decreasing rate of the occupation in queue equals to the average waiting time
for entering the queue. Hence, the average waiting time for entering the queue of
Cx+1,y,z,W can be approximated by:
−1
Wx+1,y,z,W = 1
bx+1,y,z,W − λx+1,y,z,W . (12.18)
At steady state, by applying Little s formula, the average waiting time for entering
the queue of Cx+1,y,z,W can be written as:
−1
Wx+1,y,z,W = μ̄x,y,z,N
W
x,y,z,N × λx,y,z,N
− pW . (12.19)
Substituting Wx+1,y,z,W in (12.18) and (12.19), we get the effective service rate
toward east of Cx,y,z,N as follows:
1
μ̄x,y,z,N
E
= − λx+1,y,z,W + pEx,y,z,N × λx,y,z,N , (12.20)
bx+1,y,z,W
and
μ̄x,y,z,N = ∑ pdir
x,y,z,N × μ̄x,y,z,N .
dir
(12.21)
∀dir
λx,y,z,N
qx,y,z,N = . (12.22)
μx,y,z,N − λx,y,z,N
By viewing the two queues as two independent queues, the average queuing length
is equal to the sum of the length of the two queues, which can be described as:
λx,y,z,N λx,y,z,N
qx,y,z,N = + , (12.23)
S−1 − λx,y,z,N μ̄x,y,z,N − λx,y,z,N
Table 12.3 Channel buffer depth from results of vertical buffer allocation
D Level D=1 D=2 D=3
Layer Z=0 Z=1 Z=2 Z=3 Z=0 Z=1 Z=2 Z=3 Z=0 Z=1 Z=2 Z=3
Lateral: E,S,W,N 1 4 4 7 1 1 5 9 1 1 1 13
Up 1 2 6 7 1 2 5 8 1 3 3 9
Down 5 5 5 1 5 5 5 1 5 5 5 1
Then, we can compute the full probability bx,y,z,N for Cx,y,z,N . Similar computation
is applied to all the channels in the network. The marginalized service rate μz,N for
the north input channel in layer z can be described as:
−1
μz,N = λz,N + S−1 −λz,N + μ̄z,N −λ z,N
1 1
, (12.25)
which is dependent on the marginalized arrival rate λz,N and the marginalized
effective service rate μ̄z,N . Here we simplify the computation of λz,N by directly
taking average of λx,y,z,N in each layer:
1
λz,N =
XY ∑ λx,y,z,N . (12.26)
∀x,y
Similarly, μz,N is calculated by taking average of μx,y,z,N . Therefore, we can get the
μz,N for calculating the full probability bz,N for layer z as:
1 − ρz,N k λz,N
bz,N = × ρz,N
z,N
, where ρz,N = . (12.27)
1+k
1 − ρz,N z,N μz,N
In this section, we refer the experimental setup in [16] and demonstrate the design
examples by applying the introduced design scheme. Because the latency in real
application is not infinite, the maximum average latency is set to 500 cycles for
estimating the saturation throughput. The thermal limit is set from 100 to 150 ◦ C to
cover Case 2 and Case 4 of Table 12.2, and a 200 ◦ C case is used to demonstrate the
Case 3 of Table 12.2. The buffer constraint NB is set to 16 flits for vertical buffer
allocation. Uniform random traffic is offered as an example here, and the results
as applying other traffic patterns are shown in [16]. With known traffic pattern and
given thermal limit, the method of vertical buffer allocation produces the optimal
buffer depth for each downward level, as in Table 12.3. The detail algorithm of
buffer allocation and proper downward level determination are described in [16].
Then, the achievable throughput is obtained by simulation with applying each
configuration with increasing injection rate.
We use the Table 12.4 to show the achievable throughput comparison for uniform
random traffic. First, we observe the achievable throughput of the traditional
Table 12.4 Random traffic design examples, achievable throughput (flit/node/cycle)
Design case TBD is optimal TBD is non-optimal, and LBD is non-optimal LBD is optimal
Thermal limit 100 ◦ C 110 ◦ C 120 ◦ C 130 ◦ C 140 ◦ C 150 ◦ C 200 ◦ C
LBD scheme 0.0554 0.0660 0.0767 0.0873 0.0978 0.1085 0.1695
TBD scheme 0.0642 0.0764 0.0850 0.0860 0.0860 0.0860 0.0860
Introduced scheme 0.0642 0.0764 0.0885 0.1018 0.1058 0.1173 0.1695
Downward level D=3 D=3 D=3 D=2 D=1 D=1 D=0
Improv. over LBD 15.8% 15.6% 15.5% 16.7% 8.1% 8.1% 0.0%
Improv. over TBD 0.0% 0.0% 4.2% 18.4% 23.0% 36.3% 97.1%
12 Traffic- and Thermal-Aware Routing Algorithms for 3D Network-on-Chip. . .
325
326 K.-C. Chen et al.
LBD scheme. The achievable throughput increases as the thermal limit becomes
higher, but LBD is not the scheme achieves maximal throughput. The reason is
that the thermal-limited bound of LBD is tighter than the bandwidth bound of
LBD, so the optimal criterion of LBD is not satisfied. Therefore, LBD is optimal
only for the design case of 200 ◦ C, in which the optimal criterion of LBD scheme
is satisfied. Second, we observe the achievable throughput of the traditional TBD
scheme. The achievable throughput increases as the thermal limit becomes higher
and then saturates at 0.086 flit/node/cycle. For the 100 and 110 ◦ C cases, TBD
achieves maximal throughput, because the optimal criterion of TBD scheme is
satisfied. However, as the thermal limit becomes higher, the thermal-limited bound
of TBD is relaxed. Then, the optimal criterion of TBD scheme is not satisfied. The
bandwidth bound of TBD scheme is tighter.
Comparing to LBD and TBD scheme, the introduced design scheme always
achieves maximal throughput, because it covers both LBD and TBD. For the cases
with 100 and 110 ◦ C thermal limits, the proposed scheme is identical to TBD; for the
case with 200 ◦ C thermal limit, the proposed scheme is identical to LBD. Besides,
for the cases in which the optimal criteria of LBD and TBD are not satisfied, the
proposed scheme can reach the optimal configuration in the design space expanded
between LBD and TBD. For these TBD-non-optimal and LBD-non-optimal cases,
the introduced design scheme improves the achievable throughput from 4.2 to
36.3%. The design examples also shows that although VDLAPR increase the latency
in vertical direction, the thermal-limited bound is a more serious factor that limits
performance.
Fig. 12.13 The topology transforms in online operation because of the RTM. The topology
transforms in online operation because of the RTM (©IEEE. Reprinted, with permission, from [19])
Fig. 12.14 Topology changing, the network operation stages, and the processing stages in
reconfiguration (©ACM. Reprinted, with permission, from [15])
Fig. 12.16 (a) In X-direction, the 16-bit topologies of each X-Z plane are collected and
synchronized. (b) In Y-direction, the entire 64-bit topology is collected and synchronized (©ACM.
Reprinted, with permission, from [15])
mode for data delivery toward each destination can be determined. The details of the
routing mode determination will be introduced in Sect. 12.4.3. After reconfiguration
stage (iii), the network goes to the normal stage, and the data transfer for application
layer restarts.
Fig. 12.17 Fail delivery cases that block packets. (a) Source-throttled case. (b) Destination-
throttled case. (c) Path-throttled case. (d) Long-term Head of Line (HoL) blocking caused by the
previous three cases (©ACM. Reprinted, with permission, from [15])
where the channels on the routing path are occupied by other blocked packets. If
the source router is fully throttled, the packetized message will be blocked in the
network interface. If any case in Fig. 12.17b or c occurs, the injected packets will
be blocked somewhere on the routing path and form a congestion-tree. The other
packets will be blocked as in Fig. 12.17d.
To eliminate the source-throttled case in Fig. 12.17a and the destination-throttled
case in Fig. 12.17b, the throttling information of all routers are required for each
node. The Head-of-Line (HoL) problem in Fig. 12.17b traditionally results from
the congestion in the switch, and the probability of occurrence can be reduced by
applying Virtual Channel (VC) flow control or output buffering router architectures.
Due to the source-throttled case, the destination-throttled case, and the path-throttled
case, a new type long-term HoL blocking may occurs. The long-term HoL blocking
has to be eliminated by preventing the occurrence of the aforementioned three cases.
However, the path-throttled case in Fig. 12.17c is dependent on the routing path. We
must guarantee that there is at least one non-fully throttled path toward destination
router before we inject the packet, which is routed on the guaranteed path.
Fig. 12.18 Block diagram of transport layer in the tile of thermal-aware 3D NoC (©ACM.
Reprinted, with permission, from [15])
the application layer and the request of the network layer. TT stores the throttling
information of the entire network, which represents the topology of the network.
The results of path selection, what we defined as the routing mode, are saved in
the RMM.
Figure 12.19 shows the flow chart for handling the application-layer requests.
When the application layer requests to deliver data from the current node to the
destination node, TLC first checks whether the source router is active. If the source
router is throttled, it is undeliverable for all data transfer in the period. Then, TLC
checks whether the destination router is active. If the destination router is throttled,
it is undeliverable for current data transfer. Because the downward path as the
guaranteed routable path is left, it is unconditionally routable if both the source
router and the destination router are active. However, if a general RTM is adopted,
the path-throttled case may happen. There could be no guaranteed routable path,
332 K.-C. Chen et al.
Fig. 12.20 Path selection examples of TLAR (©ACM. Reprinted, with permission, from [15])
and the checking cannot be ignored. TLC refers to RMM for setting the routing
mode flag in the packet header. After packetization, the packet is injected to the Tx
packet queue. The routers in the network layer follow the routing mode to prevent
the path-throttled case.
Figure 12.20 shows an example of the routing mode selection of the TLAR
scheme. The routing contains the downward routing [16], which is a combination of
vertical routing and lateral routing. As shown in Fig. 12.20a–c, only the lateral-first
path and the downward-first path are allowed at source router. The up-then-
lateral turns (i.e., Up-North, Up-East, Up-South, and Up-West) are prohibited to
prevent deadlocks. The downward-first path is guaranteed routable because of
the characteristics of the NSI-mesh topology. However, the lateral-first path is
guaranteed routable if and only if all the routers on the lateral-first path are active.
Therefore, in the reconfiguration stage of Fig. 12.14, the lateral-first path has to be
checked. In this sections, we investigate three downward-lateral routing algorithms
and the corresponding checking methods for determination of the routability of the
lateral-first path.
The baseline algorithm in TLAR is the combination of downward routing and de-
terministic routing. The downward routing is used for moving packets up and down
in the vertical direction. The lateral deterministic routing (LDR) is used for routing
packets in the source layer. The path diversity (i.e., number of routable path) is one,
no matter source layer or bottom layer is chosen. The selection of deterministic
routing algorithm affects the computation complexity for checking whether the
source layer is laterally routable, as shown in Fig. 12.19. For simplicity, XY routing
can be adopted, a dimension-ordered routing (DOR), for the deterministic routing.
Checking is done in incremental style for each destination during the reconfigura-
tion stage of RTM. The TLC checks if there is any fully-throttled router on the path
toward every destination based on the table that stores the throttling information.
The checking of all XY locations in the source layer can be done in O(N 2 ) by
using the incremental checking flow as the sequence number shown in Fig. 12.21b.
12 Traffic- and Thermal-Aware Routing Algorithms for 3D Network-on-Chip. . . 333
Fig. 12.21 (a) Dependency for incremental routability checking of DLDR. (b) An example of
checking in DLDR. (c) Operation flow for setting the routing mode in DLDR (©ACM. Reprinted,
with permission, from [15])
The dependency of routability is shown in Fig. 12.21a. Because the lateral routing
algorithm is XY routing, node e will be routable if d is routable. Node b is routable if
node a is routable. The operation flow of DLDR is shown in Fig. 12.21c. If a packet
is LDR routable, it first traverses through the lateral path in the source layer and then
going up or down to its destination router. Otherwise, the downward path is chosen,
and the packet traverses the lateral routing in the bottom layer.
Fig. 12.22 (a) Dependency for incremental routability checking of DLAR. (b) An example of
checking in DLAR. (c) Operation flow for setting the routing mode of a packet in DLAR (©ACM.
Reprinted, with permission, from [15])
Fig. 12.23 (a) An example of checking in DLADR. (b) Operation flow for setting the routing
mode of a packet in DLADR (©ACM. Reprinted, with permission, from [15])
The checking dependency is shown in Fig. 12.22a, where node e will be routable if c
and d are both routable. Similarly, node b is routable if node a is routable. As shown
in Fig. 12.22c, if the lateral path is guaranteed routable, the packet is first laterally
routed on the source layer. Otherwise the lateral routing is completed in the bottom
layer.
The advantages of DLAR are the increased path diversity and the capability of
load balancing. However, if there is a throttled router on one of the lateral-first
paths, DLAR chooses downward-first path to prevent the occurrence of the path-
throttled case. Therefore, the LAR routable destinations are fewer than LDR routable
destinations. To increase the number of lateral-first path routable destinations, we
introduce the downward-lateral adaptive-deterministic routing (DLADR). The idea
of DLADR is to combine the DLDR and DLAR, as shown in Fig. 12.23b. The
destinations are categorized into three types: (i) the guaranteed adaptive routable
(LAR routable), (ii) the guaranteed XY routable (LDR routable), and (iii) the
non-guaranteed lateral routable, which is downward routable. If a destination is
guaranteed adaptive routable, it is guaranteed XY routable. If a destination is
guaranteed XY routable, it is downward routable. Therefore, the downward routable
destination set is a super set of the LDR routable set, and the LDR routable set is a
super set of the LAR routable set.
12 Traffic- and Thermal-Aware Routing Algorithms for 3D Network-on-Chip. . . 335
Fig. 12.24 (a) Statistical traffic load distribution (STLD) and (b) distribution of the routing mode
decision with one 1 × 1 × 3 pillar fully throttled in the center under uniform traffic pattern (©ACM.
Reprinted, with permission, from [15])
Figure 12.23b shows the operation flow for setting the routing mode of the
packets in DLADR. Because the lateral-first adaptive routing is able to balance
the traffic loading, we should use the lateral-first adaptive routing as much as
possible. The lateral-first deterministic routing results in less traffic congestion in the
bottom layer than downward-first routing, so the priority of lateral-first deterministic
routing is higher than downward-first routing. In DLADR, we use simpler adaptive
routing instead of the throttling- and traffic-aware routing. West-first turn model
is adopted for the routing function, and the selection of output channel depends
on which channel has more unallocated flit buffers. By combining essences of
DLDR and DLAR, the vertical traffic loading is more balanced. Therefore, DLADR
is able to achieve higher throughput and lower latency with a simpler architecture.
Figure 12.23a shows an example of checking order in DLADR.
Fig. 12.25 (a) Statistical traffic load distribution (STLD) and (b) distribution of the routing mode
decision with one 2 × 2 × 3 pillar fully throttled on the diagonal under uniform traffic pattern
(©ACM. Reprinted, with permission, from [15])
Fig. 12.26 Average latency vs injection rate with (a) one 1 × 1 × 3 pillar throttled and (b) two
2 × 2 × 3 pillars throttled (©ACM. Reprinted, with permission, from [15])
experiments in Figs. 12.24 and 12.25 use the same injection rate that makes average
latency of TLAR-DLDR twice the zero load latency.
Figure 12.24a shows the STLD of the baseline downward routing and the other
three TLAR algorithms. Though there is only one throttled pillar in the network,
many packets have to be routed downward through the bottom layer as applying
the downward routing. The congestion degree of DLDR in the bottom layer is
slightly reduced but is still the second largest. The DLAR can balance network
traffic, because the congestion in the bottom layer is relaxed by applying traffic-
and throttling-aware adaptive routing algorithm. For the DLADR, more packets
are routed laterally in the source layer because of larger routing path diversity.
Figure 12.24b shows the distribution of the routing mode decision. Because of the
larger lateral path diversity provided by the TLAR, the total ratio of downward
12 Traffic- and Thermal-Aware Routing Algorithms for 3D Network-on-Chip. . . 337
12.5 Conclusions
In this chapter, the traffic- and thermal-aware routing problems on 3D NoC systems
are categorized into: off-line type and on-line type. For the off-line traffic- and
thermal-aware routing, we introduced an novel design scheme to find an optimal
design space of the system without RTM. For the thermal-aware 3D NoC system
with RTM, the topology may become the NSI-mesh, which makes the traditional
routing algorithms cannot sustain successful data delivery. To amend the shortages,
some on-line traffic- and thermal-aware routing algorithms were investigated in this
chapter. By these introduced routing algorithms, the benefits of 3D NoC system can
be preserved without the enhancement of cooling devices.
References
Abstract Scaling the transistor feature size along with the increase of chip
operation frequency leads to the growth of circuit complexity, which makes
the design of electrical interconnects increasingly difficult in large chips. Optics
provides low power dissipation which remains independent of capacity and distance,
as well as wavelength parallelism, ultra-high throughput, and minimal access
latencies. Additionally, wavelength routing, bit rate transparency, high-capacity, low
propagation loss, and low power dissipation of silicon photonics are attractive for
realizing optical Networks-on-Chip (ONoC) in Chip Multi-Processors (CMPs). In
this chapter, we propose a new architecture for nanophotonic NoC, named 2D-
HERT, which consists of optical data and control planes. The proposed data plane
is built upon a new topology and all-optical switches that passively route optical
data streams based on their wavelengths. Utilizing wavelength routing method, the
proposed deterministic routing algorithm, and Wavelength Division Multiplexing
(WDM) technique, the proposed data plane eliminates the need for optical resource
reservation at the intermediate nodes. For resolving end-point contention, we
propose an all-optical request-grant arbitration architecture which reduces optical
losses compared to the alternative arbitration schemes. By performing a series of
simulations, we study the efficiency of the proposed architecture, its power and
energy consumption, and the data transmission delay.
cores. To deal with this challenge, significant research activity has recently focused
on intrachip global communication using packet-switched micro networks, referred
to as NoC [1]. Since performance-per-watt is expected to remain the fundamental
design metric for high-performance multi-processor systems [2], on-chip inter-
connection networks will have to satisfy communication bandwidth and latency
requirements in a power efficient fashion.
Scaling the transistor feature size along with the increase of chip operation
frequency leads to growth of the circuit complexity, which makes the design of
electrical interconnects (EI) increasingly difficult in large chips. This problem that
has been predicted for about three decades [3] stems from many limitations asso-
ciated with metallic interconnects. In addition to quantified limitations of metallic
interconnect, such as latency, throughput, bandwidth and power consumption, there
are a variety of other problems that cannot be quantified easily, such as inter-
line crosstalk, wave reflection phenomena, Electromagnetic Interference (EMI) and
the difficulty of voltage isolation and timing accuracy [4]. Although using low
resistance metals such as copper and low dielectric constant materials decreases
the interconnect delay, bandwidths will be insufficient for long interconnects for
future operating frequencies, and power budget cannot be confined to the package
power constraints. While NoC, as a new architectural trend, can improve bandwidth
of electrical interconnections, it is unclear how electrical NoCs will continue to
satisfy future bandwidth and latency requirements within the package power budget
[5]. Other physical approaches that can improve the metallic interconnections, such
as cooling the chips and/or circuits (to decrease interconnect resistance), or using
superconducting lines [6] cannot address qualitative problems like voltage isolation,
timing accuracy and EMI.
Optics is a very different physical approach that can address most of the problems
associated with electrical interconnects such as bandwidth, latency, crosstalk,
voltage isolation, and wave reflection [4]. Additionally, bit rate transparency [7]
of optical switching elements and low propagation loss of optical waveguides lead
to low power dissipation of silicon photonics. Importance of power dissipation
in NoCs along with power reduction capability of on-chip optical interconnects,
offers optical network-on-chip (ONoC) as a novel technology solution which can
introduce on-chip interconnection architecture with high transmission capacity, low
power consumption, and low latency. While traditional NoCs enforce unaffordable
power dissipation in high performance MPSoCs, the unique advantages of ONoC
offer considerable power efficiency and also performance-per-watt scaling as the
most critical design metric.
Several on-chip interconnect architectures have been proposed that leverage
CMOS-compatible photonics for future multicore microprocessors. Most of the
proposed optical interconnect architectures are bus-based. For example, the Cornell
hybrid electrical/optical interconnect architecture [8] comprises an optical ring that
assigns unique wavelengths per node in order to implement a multi-bus. Firefly [9],
as a hybrid electrical/optical network, proposes the implementation of reservation-
assisted single-write-multi-read buses. Moreover, HP Corona crossbar architecture
[10] is in fact several multiple-writer, single-reader buses routed in a snake pattern
among the nodes.
13 Scalable Architecture for All-Optical Wavelength-Routed Networks-on-Chip 341
The Columbia optical network [5] is one of the few that proposes on-chip
optical switches. The proposed architecture combines a high-speed photonic circuit-
switched network with an electronic packet-switched control network. The electrical
sub-network sets up the switches in advance of data transmission and tears down
the network thereafter. Hence, the network must transmit a large amount of data to
amortize the relatively high latency of the electrical setup/teardown network. Optical
NoC proposed by Koohi et al. [11], referred to as CONoC, improves the hybrid
architecture proposed by Shacham et al. [5]. CONoC utilizes wavelength routing
method, instead of electrical methods, for optical packet ejection while retaining
the scalability of the network. Despite its simpler electrical routers and reduced
setup latency compared to previously proposed ONoCs, CONoC cannot eliminate
the role of electrical transactions. Cianchetti et al. [12] have presented Phastlane,
a hybrid electrical/optical routing network, for future multicore microprocessors.
Phastlane network is built upon a low latency optical crossbar for data transmission
under contention-less conditions. When contention exists, the router makes use of
electrical buffers and, if necessary, a high speed drop signaling network.
Almost all of the previously proposed hybrid architectures in [5, 8, 9, 11, 12]
and [13] suffer from high latency and power overheads for electrically resolving
optical contentions. Therefore, an all-optical NoC can overcome the limitations
of electrically-assisted ONoCs. Briere et al. [14] have developed an all-optical
contention-free NoC which routes optical signals according to their wavelengths.
However, the proposed contention-free structure is obtained at the cost of large
arrays of fixed-wavelength light sources and fast switches for wavelength selection,
which limit the scalability, and also severely increase power consumption and area
issues.
The selected topology of the photonic on-chip interconnect plays prime role in
the performance of ONoC architecture as well as routing and switching techniques
that can be used. Although recent studies on the design of optical on-chip networks
have addressed various network topologies, almost all of these optical architectures
[5, 9, 11, 15–17] are built upon traditional topologies initially introduced for
electrical NoCs, such as Mesh, Torus, Spidergon, Fat tree, Honeycomb, and cross-
bar. Moreover, previously developed optical architectures adopt routing algorithms
of the corresponding electrical topologies for optical data routing through the
network. Due to the inherently different properties of light transmission through
optical waveguides and photonic switching elements, novel topologies specifically
developed for optical infrastructure are inevitable to fully utilize advantages of the
silicon photonic technology.
This chapter proposes a novel topology for ONoC architectures, which con-
cerns physical properties of light transmission, and examines the advantages and
limitations of routing data streams through photonic switching elements. Some of
the most interesting characteristics of the proposed topology are: (i) regularity,
(ii) vertex symmetry, (iii) scalability to large scale networks, (iv) constant node
degree, and (v) simplicity. Moreover, this chapter proposes a general all-optical
routing algorithm which can be adapted to efficiently route optical data streams
through various topologies. The key advantages of the proposed routing algorithm
are minimalism and simplicity, which lead to small and simple optical routers.
342 S. Koohi and S. Hessabi
Built upon our novel network topology, we propose a scalable all-optical NoC
as a global communication medium, which offers all-optical on-chip routing of data
streams. Passive routing is adopted in the proposed optical architecture, which is
performed by routing optical data streams based on their wavelengths. Utilizing
wavelength routing method, our proposed optical NoC eliminates the need for
electrical resource reservation and the corresponding latency and area overheads.
Taking advantage of Wavelength Division Multiplexing (WDM) technique, the
proposed architecture avoids packet congestion scenarios (discussed in [5, 11, 12]),
and guarantees contention-free operation of the network. While the number of
switching elements in the contention-free ONoC proposed in [14] is quadratically
proportional to the number of IP blocks, the optical on-chip architecture introduced
in this chapter reduces the required number of routing elements to the number of
processing cores. The later property along with the reduced number of wavelength
channels in the proposed ONoC leads to a scalable architecture for on-chip routing
of optical packets. Moreover, combining wavelength routing method with WDM
technique improves the functionality and total performance of the proposed optical
NoC, and enables the network to transmit multicast and unicast packets efficiently.
This section introduces the data plane of the proposed architecture, built upon an
all-optical switch architecture. Before exploring the architecture of the proposed
ONoC, we discuss prominent criteria of an efficient optical topology.
Main issues in the design of optical network on-chip include the topology of the
network, its flexibility for extension with minimal links and with no, or slight,
modification of the edge nodes, short diameter to achieve low latency, simplicity of
the photonic router design, etc. Prior to presenting our novel topology, we discuss
these influential metrics in more detail.
Node degree in an optical on-chip network plays a critical role in the simplicity of
the photonic router design and its implementation cost. Although implementation
of high-degree electronic crossbar is simple, a crossbar is not a scalable photonic
topology [18] because the number of microring resonators required for photonic
crossbars increases quadratically with the node degree. Specifically, it is quite
difficult to construct optical crossbars larger than 4 × 4 using the existing 2 × 2
13 Scalable Architecture for All-Optical Wavelength-Routed Networks-on-Chip 343
Planar layout of the proposed topology enables single layer construction of optical
transmission network above the metal stack [5], thus reducing fabrication complex-
ity, the chip dimensions, and the total cost. Moreover, two-dimensional (2-D) optical
network reduces number of waveguide crossings, thus improving total waveguide
intersection crosstalk.
13.2.1.3 Scalability
13.2.1.4 Symmetry
Symmetrical and regular optical on-chip architecture facilitates the design of pho-
tonic routers and optical routing algorithm, and also reduces fabrication complexity
of the optical network.
Most recent optical NoC architectures have been implemented on top of Mesh or
Torus topologies. Despite their efficiency for electrical NoCs, node degree of five
(including connection to the local IP) complicates photonic switch design in these
architectures. Implementing Torus topology, Shacham et al. [5] solved the problem
344 S. Koohi and S. Hessabi
by introducing extra injection and ejection switches. These extra switches increase
design complexity, and add to power and area overheads. ONoC architecture
proposed by Koohi et al. [11] is built upon Spidergon topology. Despite constant
node degree of four, network diameter of the Spidergon topology limits efficiency
of the large scale ONoCs.
To overcome the drawbacks of the traditional topologies in optical NoCs, this
chapter addresses a new topology for photonic on-chip networks. The proposed
topology, which is referred to as 2D-HERT, benefits from high degree of connec-
tivity along with small node degree. 2D-HERT is built upon clusters of processing
cores locally connected by optical rings. Different clusters in the proposed topology
are interconnected by global optical 2D rings. A sample 2D-HERT interconnecting
64 optical switches is shown in Fig. 13.1a, where all connections are made through
optical links. As depicted in this figure, although 2D-HERT can be viewed as a two
dimensional hierarchical expansion of Ring topology, various optical routers are
interconnected at the same level of hierarchy, and each super node only presents a
group of four optical routers.
Schematically, the proposed topology consists of k diagonals each interconnect-
ing m (even) local clusters of four processing cores. For brevity, we will refer
to the local clusters as super-nodes (SN). Hence, for a 2D-HERT architecture
interconnecting N processing cores, we have N = 4 × k × m. Based on this notation,
each IP (and its corresponding optical switch) is uniquely determined with triplet
(d,s,p) 0 ≤ d < k, 0 ≤ s < m, 0 ≤ p < 4, where d and s refer to the index of the
corresponding diagonal and super-node, respectively, and p represents the index
of the processing core within the super-node. Figure 13.1a shows pair (s,d) for
each super-node and triplet (s,d,p) for a sample IP. The main advantages of this
topology are:
• Constant node degree of four, similar to Spidergon, for arbitrary numbers of
processing cores;
• High degree of connectivity, similar to Mesh (Torus) topology, which leads to
small network diameter;
• Regularity;
• Scalability for large scale optical networks;
• Local optical data transmission between neighbor nodes to reduce power and
delay metrics for local traffic distribution.
2-D layout of the 64-node 2D-HERT is depicted in Fig. 13.1b.
a
Node index: 2
(3,0,2)
2,0 3 1
1,0 3,0
0
2,1
1,1 3,1
Local Link
Wrap-around Link
3,2 1,2
Circular Link
2,2
3,3 1,3 Radial Link
Optical Switch
2,3
IP
b
3
0 2
1
SN SN SN SN
SN SN SN SN
SN SN SN SN
SN SN SN SN
Fig. 13.1 (a) 2D-HERT architecture, (b) 2D-HERT layout (SN super node)
node (dd ,sd ,pd ), circular links are traversed first to reach the destination diagonal.
Then, radial links are taken through this diagonal to reach the target super-node.
For minimal data routing, the proposed routing scheme may take advantage of
the wrap-around links, depending on the position of the target super-node. Finally,
within the destination cluster, intra-cluster links are utilized to reach the target node.
Figure 13.2 illustrates an example of the path taken by circular-first routing in a
144-node 2D-HERT.
It is straightforward to show that CF routing scheme takes the minimal path
through 2D-HERT architecture. Considering the proposed algorithm, distance from
346 S. Koohi and S. Hessabi
Source
Destination
the source node (ds ,ss ,ps ) to the destination node (dd ,sd ,pd ) in terms of long hops,
i.e. interconnecting super-nodes, can be computed as follows:
D = min (abs (ds − dd ) , k − abs (ds − dd )) + min (abs (ss − sd ) , m − abs (ss − sd ))
(13.1)
where, the first term represents the number of circular links traversed by the optical
data stream, and the second term shows the number of radial links taken by the CF
routing through the destination diagonal to reach the target super-node. Based on
this equation, in an N-node 2D-HERT implementing CF routing scheme, network
diameter, which is defined as the maximum shortest path length between any pair of
nodes, equals k/2 + m/2 in terms of the long hops interconnecting super-nodes.
a b
λ2λ3....λn λ 3 ....λ n λ1λ3 ....λn λ2λ3....λn λ 3 ....λ n λ1λ3....λn
λ2 λ1 λ2 λ1
Drop Microring Drop Microring Add Microring
Fig. 13.3 (a) Optical Add/Drop Element (OAD), (b) modified OAD element
(λ2 ) to (from) the optical stream without any electronic processing. To reduce the
coupling loss, ring resonators can be used to inject a single wavelength from one
waveguide into another. If the ring is placed between two waveguides and the
coupling between the ring and the waveguides is properly chosen, the wavelength
resonating within the ring will be transferred from one waveguide to the other [10].
Figure 13.3b depicts the modified OAD element realized in this way.
Utilizing OAD filters, we propose a passive optical switch, named as WaROS
(Wavelength-Routed Optical Switch). The proposed switch eliminates optical re-
source reservation at the intermediate node, which is required in most of the
previously introduced ONoCs [5, 11, 15, 16].
2D-HERT associates one wavelength to each router in the network, while the optical
streams targeted to a specific node are modulated on its dedicated wavelength.
As a key idea, the modulation wavelength of the optical packet is utilized as the
destination address to uniquely determine the routing path from the source to the
destination node. In other words, the address of the target is not contained in the
packet; rather, it is embedded in the wavelength of the optical signal.
Although some of the optical bus-based architectures, such as Corona [10], as-
sociate one wavelength to each optical router, they do not route optical data streams
based on the modulated wavelength. In a bus-based architecture, various nodes are
simply connected through multiple optical buses and modulation wavelength of the
optical stream is solely used to eject the data at the destination node. Since on-chip
data routing is not performed in the bus-based architectures, these architectures
suffer from high network diameter, which increases data transmission power and
delay. However, 2D-HERT architecture is a network of optical switches, in which a
deterministic routing algorithm determines an efficient path for optical packets from
the source to the destination node. This path may go through various intermediate
nodes. For this purpose, 2D-HERT utilizes the modulation wavelength of the packet,
as the destination address, to make proper turns at the intermediate nodes.
The minimum number of wavelength channels that guarantees contention-free
routing of optical packets considerably impacts the performance of the optical
NoC. Hence, WaROS, as a basic building block of the proposed ONoC, should
348 S. Koohi and S. Hessabi
On the other hand, optical data streams passing through the circular link
connecting ith diagonal (Di ) to i + 1th diagonal (Di+1 ) consist of optical packets
targeted to any super-node located on the diagonals at the distance d2 of Di , where
d2 ≤ k/2. Hence, the MDM for circular links can be calculated as follows:
i=5 i=1
λ48 ... λ 71 λ24 ... λ47
i=0
λ0 ... λ23
the two data streams are targeted to different destination nodes, e.g. N1 and N2 ,
associated with the same wavelength. Assuming only one data stream can be
targeted to a specific node at each time, it is easy to show that optical interference is
possible if N1 and N2 are located at different wavelength groups. Based on this
fact, to prevent interference scenarios, the proposed ONoC architecture utilizes
the path direction to distinguish optical data streams targeted to the different
wavelength groups. Specifically, clockwise direction is taken within a super node
for routing data streams to the destination nodes at the same wavelength group,
while counter-clockwise direction is chosen for routing data packets targeted to
the other wavelength group. Therefore, in the case of different wavelength groups
for source and destination nodes, path direction changes from counter-clockwise
to clockwise at the boundary of wavelength groups, where data streams enter the
target group. As shown later, the proposed optical switch is developed to passively
choose proper path direction for each optical data passing based on its modulated
wavelength. While this property may lead to non-minimal paths inside the clusters,
the proposed routing algorithm guarantees minimal routing through the long hops,
interconnecting super nodes. Figure 13.5b illustrates the non-interfering routing
paths inside source, intermediate, and target super-nodes for two pairs of source
and destination nodes depicted in Fig. 13.5a.
to the neighbor cluster on the same circle through circular links. In this regard,
we partition optical switches in 2D-HERT optical NoC into two groups; radial
and circular switches. According to the CF routing scheme, routing role differs for
different types of optical switches, which leads to different architectures for radial
and circular switches.
13 Scalable Architecture for All-Optical Wavelength-Routed Networks-on-Chip 351
13.2.4.2.1 Injection
One of the key advantages of the 2D-HERT ONoC is its capability of optical
data multicasting. Due to the low-loss and ultra-high bandwidth of the optical
waveguides, implementation of multicast communication with multiple unicast
transmissions turns into an efficient scheme in optical networks. In this regard, each
router can inject simultaneous packets to n destination nodes, where 1 ≤ n ≤ N − 1.
However, since each wavelength channel is devoted to two different nodes in the
network, to enable simultaneous data transmission to N nodes at the worst case,
WaROS uses two injection ports, i.e., I1 and I2 . In this manner, each injection
port in an optical switch is responsible for transmitting optical messages to N 2
destination nodes. Assuming Ni as the source node, the first (I1 ) and second (I2 )
injection ports of the optical switch are utilized for optical data transmission to
the same and the other wavelength groups, respectively. It is worth noting that the
appropriate injection port for each optical packet is chosen by the processing core
at the source node.
13.2.4.2.2 Ejection
13.2.4.2.3 Routing
is + k 2 is
D2,β D1,α
k −1
i =0
D1,α
D1,β Dd1
D2,α Dl
D2,β Dd2
Dd1 = R (dd , sd , pd ) dd = ds , ss − m 2 mod m < sd < ss , 0 ≤ pd < 4
(13.4d)
Dd2 = R (dd , sd , pd ) dd = ds , ss < sd ≤ ss + m 2 mod m , 0 ≤ pd < 4
(13.4e)
D2,α = R (dd , sd , pd ) k 2 ≤ dd ≤ ds + k 2 , 0 ≤ sd < m, 0 ≤ pd < 4
(13.4f)
D2,β = R (dd , sd , pd ) ds + k 2 < dd < k, 0 ≤ sd < m, 0 ≤ pd < 4 (13.4g)
where R(d,s,p) represents the optical router connected to the pth processing core
inside the sth super-node on the dth diagonal.
For appropriately demultiplexing optical data streams targeted to different
destination groups, defined above, WaROS utilizes optically controlled microring
13 Scalable Architecture for All-Optical Wavelength-Routed Networks-on-Chip 353
a E I1 I2
b E I1 I2
Local Local
EMR Local Links Local Links
Links
RMR (to l=0) λd1 (to l=2) Links λd 2 (to l=0)
(to l=2)
Radial Radial
Links Links
c E I1 I2
d E I1 I2
λ 2,α λ2,β
Local Local Local
Local
Links Links Links
Links (to l=3)
(to l=3) λ1,α (to l=1) (to l=1) λ1,β
λ2,α λ2,β
Circular Circular
Links Links
Fig. 13.7 WaROS architecture (a) p =1, (b) p =3, (c) p = 0, (d) p =2
Similar to the architecture proposed by Briere et al. [14], 2D-HERT ONoC is built
upon a passive optical architecture without requiring either electrical nor optical
reservation of the optical resources at the intermediate switches (only reservation of
the ejection channel at the destination node is required).
354 S. Koohi and S. Hessabi
CF routing scheme organizes the processing cores into two groups, while optical
data streams targeted to different groups are distinguished by the path direction
which is either a clockwise or counter-clockwise. Hence, circular-first routing
scheme can be adopted in any topology where clockwise and counter-clockwise
directions can be defined, such as Torus, Spidergon, and Ring.
The proposed architecture can benefit from WDM technique for simultaneous
transmission of unicast and multicast optical packets and assigns distinct wave-
lengths to each of unicast and multicast communication patterns in every node.
For transmitting a multicast packet, multiple unicast optical packets are generated
from the initial multicast electrical packet. In this regard, destination nodes of the
electrical multicast packet determine the modulation wavelengths of the generated
unicast optical packets.
Wavelength routing of optical packets in ONoCs [10, 11] can possibly avoid
network contention. However, to resolve end-point contention, different control
plane architectures [5, 10, 11, 16] have been proposed so far to prevent two or
more source nodes from transmitting simultaneous data to the same destination.
In the electrical control planes [5, 11, 16], electrical control packets are transmitted
through an electrical sub-network to the destination node to reserve the required
optical resources. While electrical control planes can be easily implemented [5,
11, 16], they undergo considerable power and delay overheads for electrical packet
transmission.
As an optical solution, Koohi et al. [23] have proposed a request retransmission
scheme to resolve contention at the destination node. For this purpose, in the case
of busy destination, the source node waits for a time interval and then reattempts
to transmit the optical data. Despite its simplicity, the proposed retransmission
scenario increases the total number of control packets, which leads to considerable
power overhead. Moreover, since the time interval between the consequent requests
13 Scalable Architecture for All-Optical Wavelength-Routed Networks-on-Chip 355
is determined at the source node, optical data transmission may not be initiated
immediately after the previous transmitter releases the destination node. Therefore,
data transmission latency increases and optical resource utilization decreases.
Corona architecture [10] proposes a distributed, all-optical, token-based arbitra-
tion scheme to resolve end-point contention. In this approach, a token for each node,
which represents the right to modulate on each node’s wavelength, is passed around
all the nodes on a dedicated arbitration waveguide.
In this chapter, we propose an all-optical request-grant arbitration architecture,
referred to as ORG, which reduces optical losses compared to the Corona’s
token-based architecture. As a key point, the proposed control network uses one
wavelength per sender to manage end-point contention. Hence, wavelength routing
of the data packet through Multiple-Write Single-Read (MWSR) waveguides in the
data network along with the utilization of Single-Write Multiple-Read (SWMR)
waveguides in the proposed control network offers an efficient all-optical approach
for ONoCs.
Before initiating optical data transmission at the source node, an optical request
packet is routed to the target node to check the status of the corresponding ejection
channels. In the case of free ejection channel, an optical grant packet is sent back
to the source node to initiate its optical data transmission. While in the case of busy
destination, a request flag is set at the destination node, and a grant packet is sent
back to the source node only after the receiver channels are released. To guarantee
fairness, multiple requests are granted in a round-robin fashion.
Once the existence of a free ejection channel at the destination node has been
verified, an optical message can be transmitted through the optical waveguides
and switches without buffering. The modulated wavelengths of the data streams
optically control the switching state of the resonators on the path.
13.3.2 Topology
Processing cores in each super node are electrically connected to a control unit
monitoring status of the ejection channels in the cluster. Moreover, optical control
packets, i.e. request and grant packets, are transmitted and received, respec-
tively, by the control unit in each super node. Considering 2D-HERT optical
NoC, each control unit (CU) is uniquely determined with a pair of indices (d,s)
0 ≤ d < k, 0 ≤ s < m and is connected to the four optical switches indexed with
(d,s,p) 0 ≤ p < 4. Hence, in an N-node 2D-HERT, the number of control units is
limited to N/4.
356 S. Koohi and S. Hessabi
As a simple version of the proposed control architecture, various control units are
interconnected through an optical SWMR bus [9]. As a key point, the proposed con-
trol architecture associates one wavelength channel to each control unit. Therefore,
for injecting control packets to the network, optical stream initiated by a specific
control unit is modulated on its dedicated wavelength and is transmitted through the
control bus. Hence, utilizing WDM technique, concurrent control packets can be
targeted to the same control unit.
Considering SWMR architecture for the optical bus, data transmitted through
the control bus is detected and received by all control units, where it is converted
to electrical signals and examined to determine whether the control packet is
targeted to the corresponding control unit. In this regard, taking advantage of
a single waveguide for control packet transmission leads to a simple and low
cost implementation. However, this simplicity comes at the price of increased
transmission power for control packets, since packets should be properly received
by all control units. As another alternative, we may dedicate an optical waveguide
to each control unit, which means that control packets targeted to a specific control
unit are transmitted through its associated waveguide. In this case, the number of
photonic detectors and receiver circuits located on each control waveguide reduces
to one, which leads to minimum optical power at the price of high design complexity
and implementation cost.
From the above discussion, there is a trade-off between the number of control
waveguides and the optical power dissipated for control packet transmission. As
a power-efficient low-cost implementation approach, ORG architecture partitions
N/4 control units to multiple disjoint sets, while control packets targeted to each set
are transmitted through a dedicated optical control waveguide. In this regard, optical
transmission power reduces compared to the single-waveguide approach, at the price
of higher, yet reasonable, implementation cost. As a case study, we consider one
optical control waveguide for all control units located on the same row in a planar
layout. Each control unit has a splitter that transfers a fraction of the light from the
corresponding control waveguide to a short dead-end waveguide that is populated
with detectors. Therefore, in an N/4 -node control architecture arranged in a K × L
planar layout, control units are interconnected by K different control waveguides.
To reduce network diameter, defined as the maximum shortest path length between
any pairs of nodes, each control waveguide is replaced with two unidirectional
waveguides to route the optical control packets through the minimal path. As an
example, Fig. 13.8 illustrates a 16-node control network interconnecting different
CUs in a 64-node data network. In each CU, the passive microring resonators,
shown in blue, are tuned to resonate at the associated wavelength channel of the
corresponding CU. On the other hand, the active microring resonators, shown in
green, modulate electrical control packets on the optical light stream, and only
one of them is active at the time. As depicted in this figure, the proposed control
architecture is built upon eight SWMR buses routed in a snake pattern among
the nodes.
Finally, since the data transmission network does not share optical waveguides
with the control buses, the wavelength channels devoted for data packet modulation
13 Scalable Architecture for All-Optical Wavelength-Routed Networks-on-Chip 357
CU CU CU CU
CU
CU CU CU CU
Splitter
WG 1
CU CU CU CU
WG 2
WG 3
CU CU CU CU
WG 4
can be reused in the proposed control architecture. Hence, ORG architecture does
not increase the minimum number of wavelength channels required for all-optical
operation of the network.
Unlike the data transmission approach in the data plane, the modulation wavelength
of an optical control packet in the control plane is set to the associated wavelength
channel of the transmitter. Hence, the address of the target should be embedded in
the control packet to enable control packet processing at the respective destination
node. In the case of multicast communication, only one request packet is generated
for all target CUs located on the same row, and the packet is transmitted through
the corresponding optical waveguide. However, before the initiation of multicast
data transmission, individual grant packets should be received from the destination
control units. In other words, to ensure in-order data delivery, multicast data trans-
mission is delayed until the existence of a free ejection channel is acknowledged by
all respective multicast destinations.
Control packet is formatted as follows; a one-bit identifier, represented by T,
specifies the type of the control packet (i.e. request or grant). The n field (1 ≤ n ≤ N
in an N-node network) contains the number of destinations, which equals one for
unicast data transmission. The Add field is an array of destination addresses; its
ith entry contains the address of the ith destination control unit (AddCUi ) and the
two-bit address of the target IP inside the corresponding local cluster (ADDIPi ). For
the grant packet (T = 1), a one-bit flag, represented by S, specifies the status of the
respective ejection channel at the target node. Figure 13.9 depicts the control packet
in the case of unicast data transmission.
358 S. Koohi and S. Hessabi
Data Network
7
OR OR
6 8
E/O O/E
5 9
2
1 10
IP CU CU IP
4 3
Figure 13.10 shows data transmission scenario in 2D-HERT ONoC. The key
advantages of the ORG are summarized as follows.
All-optical control and data transmission phases, proposed in this chapter, consider-
ably reduce power and delay metrics compared to previously proposed electrically-
assisted ONoCs [5, 8, 9, 11–13] which suffer from high latency and power overheads
for electrical reservation of optical resources.
Optical NoC proposed by Gu et al. [15] takes advantage of optical control
packets for path reservation. However, control units in FONoC architecture [15]
use electrical signals to configure the switching fabric according to the routing re-
quirement of each packet. Therefore, the control interfaces utilize optical-electrical
converter (OE) and electrical-optical converter (EO) to convert optical control
packets into electrical signals and vice versa, respectively. However, these opto-
electrical conversions take place at each intermediate router on the path of control
packets, which leads to considerable power increase.
In addition to all-optical architecture for control packet transmission through
TOC, the proposed control structure eliminates optical resource reservation at
intermediate routers. These key advantages improve power and delay metrics of
the 2D-HERT ONoC architecture compared to the previously proposed ONoCs.
13 Scalable Architecture for All-Optical Wavelength-Routed Networks-on-Chip 359
Reservation-assisted ONoCs [5, 8, 9, 11, 12, 15] utilize path tear-down packets to
free up the path resources to be used by other optical messages. However, since our
proposed architecture eliminates resource reservation at the intermediate switches,
the path tear-down phase is eliminated from the control phase.
13.3.4.3 Scalability
The request-grant control architecture reduces the number of control units to N/4 in
an N-node 2D-HERT, while it does not impact the minimum number of wavelengths
required in the proposed architecture. Moreover, the width of the optical control
bus is flexible and can be adjusted in the range of [1, N/4]. These architectural
advantages improve scalability of the ONoC.
for the inter-super node and intra-super node links are calculated as 23.1 ps and
7.7 ps, respectively. Moreover, since for each switch, the local link to the IP block
is shorter than the three other links, we assume length of 0.2 mm for the local
links, which leads to optical delay of 3 ps between each switch and its associated
processing core.
According to the ORG architecture from Fig. 13.7, optical links between adjacent
control units on a control waveguide have an approximate length of 2 mm, which
leads to optical delay value of 30.8 ps between adjacent CUs. Finally, 5GHz clock
speed [5] is assumed for data modulation, transmission, and demodulation in the
network interface.
Finally, in this case study, we assumed an off-chip optical light source which
feeds an on-chip filter bank of N/2 band-pass optical filters. In this case, outputs of
the filter bank (N/2 wavelength channels) are routed to all optical transmitters across
the chip to provide required wavelength channels.
Message Duration
α= (13.5)
Message Duration + μ
Finally, the traffic pattern in the network highly depends on the distribution of the
packet destination. For investigating efficiency of the proposed ONoC, we perform
various simulations under different synthetic workloads: uniform, hotspot, local,
and first matrix transpose (FMT) [26]. In the case of local traffic, neighbor nodes
for each processing core are assumed to be located within the same local cluster
(super-node). Although 2D-HERT is capable of data multicasting, in this case study,
we only discuss unicast data transmission.
Assuming per-port injection bandwidth of 640 Gbps, each optical data stream
should be modulated on 64 wavelengths at the rate of 10 Gbps [27]. In a 64-node
2D-HERT architecture, maximum degree of multiplexing is computed as N/2 = 32.
Hence, assuming 128 available wavelength channels, each optical data packet
13 Scalable Architecture for All-Optical Wavelength-Routed Networks-on-Chip 361
Since the impact of traffic pattern is analyzed in Sect. 13.6, in this section, we
assume a uniform distribution for destinations.
Table 13.2 lists the main contributors to the optical latency for data and control
packet transmission at the future 22 nm CMOS process technology and 5 GHz clock
frequency [11]. Based on these parameters, Fig. 13.11a depicts the average optical
latencies for control and data phases for varying values of α. Simulation results of
average total latency for data transmission through the proposed optical architecture,
computed as the sum of control and data transmission latencies are also shown in
Fig. 13.11a. As depicted in this figure, for all traffic conditions, total latency for the
362 S. Koohi and S. Hessabi
a b
1.E+4 10
8.E+3 8
6.E+3 Data Delay 6
4.E+3 Total Delay 4
Control Delay
2.E+3 2
0.E+0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Offered Load Offered Load
Fig. 13.11 (a) Delay values of 2D-HERT, (b) waiting interval as the percentage of total delay
data transmission is dominated by the optical data routing through the waveguides.
Specifically, in the case of light traffic conditions, the latency of the control phase is
negligible compared to the total latency and this ratio retains small values even for
high traffic conditions.
Regarding the ORG architecture, in the case of busy destination, optical data
transmission is postponed and the data packet is buffered at the source node. Data
queuing latency at the transmitter side, referred to as transmitter waiting interval,
impacts average latency of the optical control phase and hence, total performance
of the 2D-HERT ONoC. Figure 13.11b shows average value of the transmitter
waiting interval as the percentage of total latency for varying values of α. As
shown in this figures, for low traffics, the waiting interval is inconsiderable, and
this impact remains tolerable for high traffic loads. The latter property stems from
the fact that the ORG architecture takes advantages of small optical control packets
transmitted through the optical waveguides without imposing resource reservation
at the intermediate nodes. Moreover, for receiving optical packets at the destination
node, ejection channel of the destination optical switch is only occupied for a short
period of time as a result of ultra-high bandwidth of optical waveguides.
Total power consumption for transmitting an optical message through the 2D-HERT
architecture is computed as the sum of optical and electrical losses for both the
data and control packet transmissions. While optical power is dissipated in the
13 Scalable Architecture for All-Optical Wavelength-Routed Networks-on-Chip 363
With an off-chip light source, the input laser power is constant and determined
by the worst case optical loss in the network. The optical power loss for a single-
wavelength data transmission through a source-destination path is computed as
follows:
where Non and Noff represent the number of resonators, passed by the optical
message, in the ON and OFF states, respectively. PMR,DP and PMR,TP stand for the
drop-port and through-port insertion loss for a passive microring, respectively. Ll
and Ls are inter and intra super-node optical link lengths which are approximately
1.5 and 0.5 mm, respectively, HopCountl and HopCounts are the number of long
hops and short hops (intra super-node hops), respectively, passed by the optical
message, PB is the waveguide bending loss, PW is the waveguide propagation loss
per unit distance, PIL,WC is the waveguide crossing insertion loss, NB and NWC stand
for the number of waveguide bendings and crossings, and PCR is the coupling loss
from the waveguide to the optical receiver. For a bit error rate of 10−15 , the minimum
power required by the receiver is −22.3dBm [28]. Hence, the power required for a
multi-wavelength data transmission equals:
where PT and PR represent the power consumed by the transmitter and receiver
circuits, respectively, at the 22 nm technology. Table 13.3 shows the values of these
parameters in the network, which are extracted from the literature [11, 12, 29–31].
In addition to power dissipation in the data plane, control packet transmission
through the optical control plane imposes additional optical losses. Based on
the ORG architecture, a control packet transmitted through an optical control
364 S. Koohi and S. Hessabi
waveguide is detected and received by all control units located on the corresponding
row. Therefore, the optical power loss for a single-wavelength control packet
transmission through the control plane is computed as follows:
where LCW represents the length of the optical control waveguide, PSplitter is the
power loss per split, N/4 presents the number of control units located on a control
waveguide, and the remaining parameters are defined in the same way as those in
Eq. 13.7. After consideration of receiver sensitivity, the minimum amount of power
required for multi-wavelength control packet transmission is computed as:
When the optical and electrical losses for both the data and control packet
transmissions are taken into account, total power consumption for transmitting an
optical message through the all-optical 2D-HERT architecture is computed form
Eq. 13.6.
Finally, ring filters and modulators have to be thermally tuned to maintain
their resonance wavelength modes under on-die temperature variations. Monolithic
integration gives the most optimistic value for ring heating efficiency of all
approaches (due to in-plane heaters and air-undercut), estimated at 1 μW per ring
per K [18]. To calculate the required power for thermal tuning, we assume that,
under typical conditions, the rings in the system would experience a temperature
range of 20 K which leads to a thermal tuning power of 20 μW per ring.
13 Scalable Architecture for All-Optical Wavelength-Routed Networks-on-Chip 365
Power (mW)
100
80
60
40
20
0.0 0.2 0.4 0.6 0.8 1.0
Offered Load
In this section, we compare the worst-case on-chip power loss in the ORG and
Corona’s token-based arbitration architectures. For this purpose, the maximum
insertion loss for the source-destination path is calculated in each of these ar-
chitectures utilizing the power parameters from Table 13.3. This value, which is
traffic-independent, equals 8 and 9.96 dB in the request-grant and token-based
architectures, respectively. In other words, the proposed control architecture reduces
on-chip optical loss by 37% over the corona’s control plane. In the ORG architec-
ture, the power reduction mainly stems from the reduced number of off-resonance
microring resonators passed by each control packet. Specifically, in the token-based
architecture, an optical token conveys the right to send data to a specific destination
node. Hence, each token should be continuously routed through all optical nodes
to enable probable data transmission form a source node to the corresponding
destination node. However, in the request-grant control scheme, optical control
packets are generated and transmitted only prior to the data transmission, and
they are not required to pass through all optical nodes. For example, in the ORG
architecture shown in Fig. 13.8, each control packet passes through four CUs located
on the corresponding control waveguide.
Utilizing the extracted power models, the 2D-HERT simulator computes the amount
of power consumption for each optical packet received at the destination node.
The maximum value for different source-destination pairs is considered to be the
required input laser power. Figure 13.12 shows the power consumption for data and
control packet transmissions through the all-optical 2D-HERT for varying values of
α. Simulation results of the total power dissipated for data transmission through the
proposed optical architecture is also shown in Fig. 13.12. As shown in this figure,
routing control packets optically through the network consumes approximately 68%
and 76% of the total power dissipated for optical data transmission in the case of low
366 S. Koohi and S. Hessabi
Energy (uJ)
0.004
0.002
0
0.0 0.2 0.4 0.6 0.8 1.0
Offered Load
and high offered loads, respectively. Therefore, we conclude that power consumed
in optical and electro-optical devices for routing control packets dominates total
power consumption of the network, which emphasizes the importance of control
design optimization in all-optical NoC architectures.
Figure 13.13 depicts average amounts of energy dissipated for control and data
packet transmission through the 2D-HERT ONoC for varying values of α. As
depicted in this figure, although power consumption of control architecture dom-
inates that of the data network, total energy of the proposed optical network is
dominated by the energy dissipation for data packet, rather than the control packet,
transmission. The latter property stems from the small latency for control packet
transmission which reduces the total energy dissipated in the proposed control
architecture.
This section analyzes system-level metrics of the 2D-HERT optical NoC compared
to those of the ENoC, built upon the same topology, under various workloads.
Based on the predictions made by Shacham et al. [5], we assumed 168-bit flits
for data transmission between adjacent routers in the ENoC under 5-GHz clock
frequency. Router processing delay is assumed to be 600 ps [5]. Moreover, we
suppose propagation velocity of 131 ps/mm in an optimally repeated wire at
22 nm technology [11]. The estimated power consumption per unit length for
electrical wires is about 1 mW/mm [32]. Shacham et al. [5] have reported energy
13 Scalable Architecture for All-Optical Wavelength-Routed Networks-on-Chip 367
0
0 20 40 60 80 100
Traffic Percentage (p)
Power Ratio 42
Delay Ratio
7
39
4
36
1 33
0 20 40 60 80 100 0 20 40 60 80 100
Traffic Percentage (p) Traffic Percentage (p)
! !
Fig. 13.15 Traffic analysis (a) DelayENoC Delay2D−HERT , (b) PowerENoC Power2D−HERT
a
5
1 1 1 1
0.5
Uniform HotSpot Local FMT
b
3
1 1 1 1
0.3
Uniform HotSpot Local FMT
c
100
2D-HERT
Corona
10
Firefly
LR 1
Phastlane
ETorus 0.1
Uniform HotSpot Local FMT
Fig. 13.16 Normalized system-level metrics (with respect to 2D-HERT) for various synthetic
workloads (a) delay, (b) power, (c) energy
waveguide. If a node can grab a token, it absorbs the token, transmits the packet,
and then releases the token to allow other nodes to obtain the token.
Firefly architecture [9] is implemented as multiple, smaller crossbars, and pre-
vents global arbitration by using localized, electrical arbitration performed among
smaller number of ports. Instead of using multi-write optical buses, the Firefly
topology uses multi-read optical buses assisted with broadcast communication for
path reservation and channel arbitration. In the Phastlane architecture [12], as
a hybrid optical/electrical network, routers utilize electrical buffers for resolving
optical contention at the intermediate routers; in the case of fully occupied buffers,
optical packets are dropped and a high speed drop signal is sent back to the
transmitter.
Figure 13.16a–c depict normalized data transmission delay, power consumption,
and energy dissipation of the various architectures to those of 2D-HERT optical
NoC, respectively. As depicted in these figures, in general, 2D-HERT is the most
efficient optical architecture under various traffic patterns, while ETorus leads to the
worse power and delay metrics compared to the optical architectures. As follows,
we discuss the simulation results in details.
370 S. Koohi and S. Hessabi
Unlike Phastlane [12], 2D-HERT architecture eliminates the need for resource
reservation at the intermediate nodes, and hence, reduces the arbitration overhead.
Moreover, optical messages are neither electrically buffered nor dropped at the
intermediate routers. These architectural advantages improve total performance of
the proposed architecture over Phastlane. On the other hand, compared to Corona
[10] and Firefly [9], 2D-HERT reduces data transmission delay due to the smaller
network diameter. In all, averaged across different traffic patterns, 2D-HERT ONoC
reduces data transmission delay by 24%, 15%, 18%, 4%, and 70% over Phastlane,
Firefly, Corona, λ-router, and electrical Torus, respectively. However, it is worth
noting that under hotspot traffic pattern, λ-router reduces average packet delay by
11% against the proposed architecture according to its arbitration-free architecture.
As discussed before, the large number of microring resonators results in high power
consumption for data transmission through the λ-router. In Phastlane, electrical
control network along with the electrical buffering of optical packets, in the case
of contention, considerably increases the worst case power consumption in this
architecture compared to 2D-HERT.
In Corona crossbar, high insertion loss of the optical crossbars, along with
the global arbitration phase and the large number of off-resonance microrings
passed by an optical packet, increases the worst case power consumption per-
packet transmission in the network. With the assumption of a single-cycle router,
Firefly reduces per-packet power consumption by approximately 5% over Corona
crossbar topology due to its localized arbitration scheme. However, Firefly uses
local electrical meshes, which increase the total power consumption due to the
associated router and electrical link traversal power. This impact, along with the
reservation broadcasting in Firefly architecture, results in higher worst case power
consumption, compared to the all-optical 2D-HERT architecture.
In all, the 2D-HERT architecture achieves average per-packet power reduction
of 58%, 47%, 52%, 45%, and 95% over Phastlane, Firefly, Corona, λ-router,
and electrical Torus, respectively, for various traffic patterns. Finally, simulation
results from Fig. 13.1c show that 2D-HERT results in 68%, 55%, 61%, 46%, and
98% lower per-packet energy consumption compared to Phastlane, Firefly, Corona,
λ-router, and electrical Torus, respectively.
Evaluating the impact of traffic distribution on the efficiency of 2D-HERT archi-
tecture, Fig. 13.16a through Fig. 13.16c confirm that when locality is introduced in
the traffic, the proposed architecture leads to much lower per-packet delay, power,
and energy consumption which is due to the smaller waiting interval in the case
of local or permutation traffic patterns. While in Phastlane, optical contention at
any node on the source-destination path prevents optical data transmission, in 2D-
HERT, only end-point contention postpones optical data transmission. Moreover,
13 Scalable Architecture for All-Optical Wavelength-Routed Networks-on-Chip 371
13.7.1 Laser
The main challenge for silicon photonics is growing the laser on a silicon chip,
because silicon is a poor laser material [35]. In February 2011, however, researchers
at the University of California announced that they had overcome this problem by
taking advantage of the properties of nanostructures and by carefully controlling
the growth process [36]. This is the first time that researchers have grown lasers
from high-performance materials directly on silicon. The breakthrough can facilitate
the process of growing nano-lasers directly on silicon, thus paving the way for
integrating optical components on silicon chips.
13.7.2 Layout
Prior work has advocated both same-die and separate-die integration of optical
components [17, 37, 38]. Monolithic integration has less interfacing overhead and
higher yield than 3D stacking, but requires the optical components, which are
relatively large, to consume active die area. 3D stacking, on the other hand, follows
trends of future interconnects occupying separate layers and allows the CMOS and
photonic processes to be independently optimized. The optical layer need not have
any transistors, consisting only of patterning the waveguides and rings, diffusion
to create the junctions for the modulators, germanium for the detectors, and a
metalization layer to provide contacts between layers.
13.7.3 Area
Unlike electrical devices, optical devices are not readily scalable with technology
node due to the light wavelength constraint. For instance, while transistor sizes are
largely determined by the technology scale, ring resonator dimensions are largely
determined by the coupling wavelength. Therefore, compact photonic switching
elements are inevitable to build an optical on-chip network in future MPSoC
designs. Ring sizes may shrink with improvements in wavelength, but improvements
are limited; estimates indicate that rings start losing effectiveness at radii <1.5 μm
[39]. Hence, it is unclear how hundreds of resonator-based photonic switches will
be integrated on a single chip without considerable area overhead.
References
1. L. Benini, G. De Micheli, Networks on chips: A new SoC paradigm. IEEE Comp. 35(1), 70–80
(2002)
2. A. Shacham, K. Bergman, L.P. Carloni, Maximizing GFLOPS-per-Watt: High-bandwidth, low
power photonic on-chip networks, in P = ac2 Conference (New York, 2006), pp. 12–21
3. K.C. Saraswat, F. Mohammadi, Effect of scaling of interconnections on the time delay of VLSI
circuits. IEEE Trans. Electron. Dev. ED-29, 645–650 (1982)
4. D. Miller, Rationale and challenges for optical interconnects to electronic chips. Proc. IEEE.
88(6), 728–749 (2000)
5. A. Shacham, K. Bergman, L.P. Carloni, Photonic networks-on-chip for future generations of
chip multi-processors. IEEE Trans. Comput. 57, 1–15 (2008)
6. F. Adam, R. Gutiérrez-Castrejón, I. Tomkos, B. Hallock, R. Vodhanel, A. Coombe, W. Yuen,
R. Moreland, B. Garrett, C. Duvall, C. Chang-Hasnain, Transmission performance of a 1.5-μm
2.5-Gb/s directly modulated tunable VCSEL. Photonics Technol. Lett. 15(4), 599–601 (2003)
7. C. Guillemot, M. Renaud, P. Gambini, C. Janz, I. Andonovic, R. Bauknecht, B. Bostica, M.
Burzio, F. Callegati, M. Casoni, D. Chiaroni, F. Clerot, S.L. Danielsen, F. Dorgeuille, A. Dupas,
A. Franzen, P.B. Hansen, D.K. Hunter, A. Kloch, R. Krahenbuhl, B. Lavigne, A. Le Corre, C.
Raffaelli, M. Schilling, J.-C. Simon, L. Zucchelli, Transparent optical packet switching: The
European ACTS KEOPS project approach. J. Lightwave Technol. 16(12), 2117–2134 (1998)
8. N. Kirman, M. Kirman, R.K. Dokania, J.F. Martinez, A.B. Apsel, M.A. Watkins, D.H. Al-
bonesi, Leveraging optical technology in future bus-based chip multiprocessors, in IEEE/ACM
Annual International Symposium on Microarchitecture (Florida, USA, 2006), pp. 492–503
374 S. Koohi and S. Hessabi
29. G. Chen, H. Chen, M. Haurylau, N. Nelson, P.M. Fauchet, E.G. Friedman, D.H. Albonesi,
Predictions of CMOS compatible on-chip optical interconnect. VLSI J. Integr. 40(4), 434–446
(2007)
30. M. Lipson, Guiding, modulating, and emitting light on silicon-challenges and opportunities. J.
Lightwave Technol. 23(12), 4222 (2005)
31. D.M. Vantrease, Optical tokens in many-core processors, University of Wisconsin (2010)
32. M. Haurylau, C.Q. Chen, H. Chen, J.D. Zhang, N.A. Nelson, D.H. Albonesi, E.G. Friedman,
P.M. Fauchet, On-chip optical interconnect roadmap: Challenges and critical directions. IEEE
J. Sel. Top. Quant. Electron. 12(6), 1699–1705 (2006)
33. K. Greene, A record-breaking optical chip, MIT Technol. Rev. Available online at http://www.
technologyreview.com/Infotech/21005/?aâĂL’=âĂL’f
34. L. Wosinski, W. Zhechao, Integrated silicon nanophotonics: A solution for computer in-
terconnects, in IEEE International Conference on Transparent Optical Networks (ICTON)
(Stockholm, Sweden, 2011), pp. 1–4
35. ITU-T Technology Watch Report, The optical world, June 2011, Available online at http://
www.itu.int/ITU-T/techwatch
36. K. Bourzac, Laser-Quick Data Transfer, MIT Technol. Rev. Available online at http://www.
technologyreview.com/computing/32324/
37. B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. Loh, D. McCaule, P. Morrow,
D. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, C. Webb, Die stacking (3D)
microarchitecture, in IEEE/ACM International Symposium on Micro-architecture (Florida,
USA, 2006), pp. 469–479
38. C. Batten, A. Joshi, J. Orcutt, A. Khilo, B. Moss, C. Holzwarth,M. Popovic, H. Li, H. Smith, J.
Hoyt, F. Kartner, R. Ram, V. Stojanovic, K. Asanovic, Building manycore processor-to-dram
networks with monolithic silicon photonics, in IEEE/ACM International Symposium on Micro-
architecture (New York, USA, 2009), pp. 8–21
39. Q. Xu, D. Fattal, R. Beausoleil, Silicon microring resonators with 1.5 μm radius. Opt.
Express.16(6), 4309–4315 (2008)
40. B. Guha, B. Kyotoku, M. Lipson, CMOS-compatible athermal silicon microring resonators.
Opt. Express 18, 3487–3493 (2010)
41. J. Ahn, M. Fiorentino, R.G. Beausoleil, N. Binkert, A. Davis, D. Fattal, N.P. Jouppi, M.
McLaren, C.M. Santori, R.S. Schreiber, S.M. Spillane, D. Vantrease, Q. Xu, Devices and
architectures for photonic chip-scale integration. Appl. Phys. Mater. Sci. Process. 95(4),
989–997 (2009)
42. C. Nitta, M. Farrens, V. Akella, Addressing system-level trimming issues in on-chip
nanophotonic networks, in IEEE International Symposium on High Performance Computer
Architecture (HPCA) (San Antonio, Texas, USA, 2011), pp. 122–131
43. Y. Pan, J. Kim, G. Memik. Flexishare, Channel sharing for an energy-efficient nanophotonic
crossbar, in IEEE International Symposium on High Performance Computer Architecture
(HPCA) (Bangalore, India, 2010), pp. 1–12
Part VI
Industrial Case Study
Chapter 14
On Chip Network Routing for Tera-Scale
Architectures
This chapter includes material adapted from our earlier publications [1–3].
A.S. Vaidya ()
Nvidia Corporation, 2701 San Tomas Expy, Santa Clara, CA 95050, USA
e-mail: aniv@nvidia.com
M. Azimi • A. Kumar
Intel Corporation, 2200 Mission College Blvd, Santa Clara, CA 95052, USA
e-mail: mani.azimi@gmail.com; akhilesh.kumar@intel.com
14.1 Introduction
Designing processors with many cores has been widely accepted in the industry as
the primary approach for delivering ever increasing performance under hard power
and area constraints. General purpose processors already have several tens of cores
and can be expected to increases to a few hundred in this decade. For example,
the Intel® Xeon Phi™ coprocessor code-named Knights Corner [6] which became
available in late 2012 has 60 cores and can compute up to 1 teraFLOPS of double
precision peak performance [18]. Such tera-scale processors provide a platform for
use across a wide array of application domains, taking advantage of increasing
device densities offered by Moore’s law.
A typical implementation of such a processor includes tens of general-purpose
cores today (and possibly a hundred plus cores in the near future), multiple
levels of cache memory hierarchy to mitigate memory latency and bandwidth
bottlenecks, and interfaces to off-chip memory and I/O devices. Most many-core
chips use various building blocks connected through an on-chip interconnect to
realize specific products. Scaling this architecture to future process generations
requires a flexible, capable and optimized on-chip interconnect. In this chapter,
we detail the technology development, research and prototyping environments for
high-performance on-chip interconnects with application to scalable server and
high-performance compute processor architectures.
On-chip interconnects can take advantage of abundant wires, smaller clock
synchronization overhead, lower error rate, and lower power dissipation in links
compared to off-chip networks, where chip pin counts and power dissipation in the
transceivers and links dominate design considerations. However, efficient mapping
to a planar substrate places restrictions on suitable topology choices [10]. Fur-
thermore, a need to support performance isolation, aggressive power-performance
management using dynamic voltage-frequency scaling (DVFS) techniques, and
handling within-die process variation effectively places additional requirements on
topology selection at design time [11].
In addition to physical design considerations that affect topology choices as
above, there are workload considerations. A general-purpose many-core processor
must efficiently run diverse workloads spanning legacy and emerging applications
from domains as varied as scientific computing, transaction processing, visual
computing, and cloud computing. Such workloads may exhibit communication
characteristics with transitory hot-spots, jitters, and congestion among various
functional blocks. Therefore, it is an imperative design requirement for on-chip
interconnects to respond to these conditions gracefully.
There are several plausible approaches for designing the on-chip interconnect
for a many-core tera-scale processor chip. Our work has focused on a flexible
interconnect architecture based on two dimensional mesh and torus topologies. This
architecture is further augmented by a rich set of routing algorithms for supporting
various features. These architecture details and our earlier motivations have been
documented in [1–3]. In this chapter, we primarily focus on the routing algorithms.
14 On Chip Network Routing for Tera-Scale Architectures 381
numbers of lower performance processors are densly packaged together to scale out
workloads that do not need shared address spaces. We expect such approaches to be
adapted to take advantage of energy and cost efficiencies provided by integration of
large number of cores on a single die. An environment to support this should allow
dynamic allocation and management of compute, memory, and IO resources with
as much isolation between different partitions as possible. A large set of allocation
and de-allocation of resources can create fragmentation that may not provide a
clean and regular boundary between resources allocated for different purposes. The
interconnection network bridging these resources should be flexible enough to allow
such partitioning with high quality of service (QoS) guarantees and without causing
undue interference between different partitions.
Cost and yield constraints for products with large numbers of cores may create a
requirement for masking manufacturing defects or in-field failures of on-die com-
ponents that in turn may result in configurations that deviate from the ideal topology
of the on-chip interconnect. Another usage scenario that can create configurations
14 On Chip Network Routing for Tera-Scale Architectures 383
that are less than ideal is an aggressive power-management strategy where certain
segments of a chip are powered-down during periods of low utilization. Such
scenarios can be enabled only when the interconnect is capable of handling irregular
topologies with graceful performance degradation.
Fig. 14.1 Examples of 2D mesh and torus topology variants supported. (a) 2D mesh. (b) 2D torus.
(c) Mesh-torus. (d) Concentrated mesh-torus
virtual channels to allow different types of traffic to share the physical wires. Some
of the relevant high-level architectural features of the interconnect are discussed in
Sect. 14.3. A more detailed discussion of these aspects is available in [2, 3].
Figure 14.1 depicts four on-chip interconnect topology options optimized for a CMP
with the tiled modular design paradigm. All options are variants of 2-dimensional
(2D) mesh or torus topologies. In option (a), each processor tile is connected
to a 5-port router that is connected to the neighboring routers in both X and Y
dimensions. IO agents and memory interfaces can be connected to the local port
of the router or directly to an unused router port of a neighboring tile on the
periphery of the chip. Option (b) depicts a 2D folded torus with links connecting
routers in alternate rows and/or columns to balance wire delays. Compared to option
(a), average hop count per packet is reduced at the expense of longer wires and
14 On Chip Network Routing for Tera-Scale Architectures 385
the number of wires in the wiring channels is doubled for the same link width.
In this particular example, more number of routers and links are needed to connect
the peripheral devices. Option (c) shows a hybrid 2D mesh-torus topology with
wraparound links in X dimension only by exploiting the fact that all peripherals
are now located along the Y dimension. Compared to option (a), the same number
of routers are used but the number of wires is doubled in the X dimension.
Option (d) shows yet another variation of hybrid mesh-torus with two cores sharing
one router, resulting in a concentrated topology that requires fewer routers for
interconnecting the same number of cores. To enable this topology, either an extra
port is required in every router to accommodate the additional core in a tile, or the
two cores in a tile share a single local port through a mux and de-mux at the network
interface.
The network topologies shown in Fig. 14.1 and many others can utilize the same
basic router design. The tradeoff between different network topologies depends on
the specific set of design goals and constraints of a given product. In other words
substantial topology flexibility can be achieved through minor design modifications.
In the next subsection, we discuss the micro-architecture and pipeline of one such
router in detail.
a
Local Global Switch Link
Flit 0 Arbitration Arbitration Traversal Traversal
(LA) (GA) (ST) (LT)
b
Global Switch Link
Flit 0 Arbitration Traversal Traversal
(GA) (ST) (LT)
Fig. 14.2 Overview of the router pipeline. (a) Router pipeline under heavy traffic load. (b) Router
pipeline under light traffic load
It should be noted that the LT stage is the boundary between a router and a link and
is not considered as a stage in the router pipeline per se. The bulk of the LA stage
operations (except some book keeping) is skipped entirely when a flit arrives at a
previously idle input port with no queued up flits, thereby reducing the router latency
under light traffic load to just two cycles. In such a case, the flit (or packet) proceeds
directly to GA stage as it is the only candidate from that input port. The pipeline is
reconfigured automatically according to the traffic conditions. Figure 14.2 captures
the router pipeline stages under heavy and light traffic conditions.
Figure 14.3 shows the key functions performed in each stage. Of most relevance
to us here is the route pre-compute functionality that is implemented in the GA or
switch arbitration stage. We will discuss route computation in a later part of this
subsection.
14 On Chip Network Routing for Tera-Scale Architectures 387
Our router architecture relies on virtual channel flow control [8] both to improve
performance and to enable support for deadlock-free routing with various flavors of
deterministic, fault-tolerant and adaptive routing. The set of virtual channels (VCs)
is flexibly partitioned into two logical sets: routing VCs and performance VCs. VCs
are also logically grouped into virtual networks (VNs). VCs belonging to the same
VN are used for message-class (MC) separation required by protocol-level MCs,
but they use the same routing discipline. Routing VCs are also used for satisfying
deadlock-freedom requirements of particular routing algorithms employed. Each
routing VC is associated with one and only one MC with at least one reserved credit.
Performance VCs belong to a common shared pool of VCs. They can be used by
any MC at a given time both for adaptive or deterministic routing schemes. A VC is
388 A.S. Vaidya et al.
a b
VC 0 MC0, VN0 VC 0 MC0, VN0
VC 8 Perf VC 8 Perf
VC 9 Perf VC 9 Perf
VC 10 Perf VC 10 Perf
VC 11 Perf VC 11 Perf
Fig. 14.4 Example mapping of virtual networks to virtual channels for 2D mesh and torus
topologies. (a) 2D mesh. (b) 2D torus
used by only one message at any given time to manage design complexity and to
ensure deadlock freedom for support of fully adaptive routing based on Duato’s
theory [13].
The number of VCs supported in a design is a function of design-time goals, area
and power constraints. Figure 14.4 depicts examples of mapping of the supported
VCs into routing VCs belonging to specific message-classes and VNs required for
supporting minimal deadlock free XY routing algorithms for 2D mesh and torus
topologies, as well as the pool of performance VCs. The example configurations
assume a total of 12 VCs and 4 MCs; the mesh requires a single VN (VN0) for
deadlock freedom, whereas the torus requires 2VNs (VN0, VN1).
A single shared buffer at each input port [19] is used to support flexibility and
optimal usage of packet buffering resources with respect to performance, power, and
area. The buffer is shared by all VCs, either routing VCs or performance VCs, at a
port. Buffer slots are dynamically assigned to active VCs and linked lists are used to
track flits belonging to a given packet associated with a VC. A free buffer list tracks
buffer slots available for allocation to incoming flits.
14 On Chip Network Routing for Tera-Scale Architectures 389
Fig. 14.5 Route pre-compute (Cn is north port connection bit; Rne and Rnw are turn-restriction
bits for northeast and northwest turn, respectively)
The router uses credit-based flow control to manage the downstream buffering
resources optimally. It tracks available input buffer space in the downstream router
at the output side of the crossbar, through handshaking between two parts: upstream
credit management and downstream credit management.
Routing determines which path a packet takes to reach its destination. We use a
distributed routing scheme to select an output port at a given router that a packet
must take to move towards its destination. For minimal adaptive routing, up to two
distinct directions may be permitted based on the region a destination node falls
into. The routing decision (i.e., output ports and VN choices permitted) at each
router is based on the current input port and VN a packet belongs to, as well as on
the destination address of the packet. Two different options are supported in our
design: a compressed table-based distributed routing (TBDR) [14, 20] and logic
based distributed routing (LBDR) [15] in order to enable a wide set of routing
algorithms to be efficiently implemented. Compared to the LDBR scheme, the
TBDR scheme uses a 9-entry table per router providing more routing flexibility
with higher storage overhead.
The LBDR scheme uses a connection bit per output port and two turn-restriction
bits. Different routing algorithms can be supported by setting appropriate turn
restrictions. Our router architecture uses route pre-computation [10, 16] for the
route decision of neighboring routers, thereby removing route computation from
the critical path of the router pipeline. As shown in Fig. 14.5, it can be divided into
two steps: (1) compute route tags based on packet destination, identifying the target
quadrant and (2) determine the output port based on the selected routing algorithm.
In an adaptive routing scheme, this can imply that the packet may have a choice of
more than one output port towards its destination. We support up to two output port
choices.
390 A.S. Vaidya et al.
In this section we describe the support for various routing algorithms to enable
a flexible, configurable, and adaptive interconnect, and we discuss the design
implications.
The router architecture supports distributed routing wherein the subsets of the
routing decisions are made at each router along the path taken by a given packet.
In two-dimensional networks like mesh and torus, given a source node, the set
of shortest paths to the destination fall into one of four quadrants. With the LBDR
framework [15], any turn model based routing algorithm [17] such as XY, west-
first, odd-even [5], etc., can be implemented by setting the turn-restriction bits
appropriately. For minimal adaptive routing, up to two distinct directions may be
permitted based on the quadrant a destination node falls into. The routing decision
(i.e., output ports and VN choices permitted) at each router is based on the current
input port and VN a packet belongs to, as well as on the desired destination. For
each VN, we support the flexible algorithms with very economical storage of only
a few bits per port or with an alternative small 9-entry table [2].
Minimal path deterministic routing in mesh and torus and partially and fully
adaptive minimal path routing algorithms, such as those based on the turn model
[17], are supported. Our adaptive router architecture uses Duato’s theory [13] to
reduce the VC resource requirements while providing full adaptivity. Table 14.1
shows a comparison of the minimum number of VCs required to implement
deadlock-free fully adaptive routing using the turn model versus one based on
Duato’s theory.
TBDR routing support also enables a deterministic fault-tolerant routing
algorithm based on fault-region marking and fault-avoidance, such as in [4], as
well as adaptive fault-tolerant routing algorithms [12]. Incomplete or irregular
topologies caused by partial shutdown of the interconnect because of power-
performance tradeoffs can be treated in a manner similar to a network with faults
for routing re-configuration purposes.
Pole routing is a novel two-stage routing algorithm for 2D meshes that supports
regular, irregular and faulty mesh networks. The message routing is done in the first
stage by sending it to predetermined intermediate destination called a pole node. In
the second stage the message is forwarded from the pole node to the final destination
node. Each routing stage uses a minimal deadlock-free routing algorithm in a mesh.
However, the full route from the source to destination may or may not be minimal
depending on the location of the pole with respect to the source-destination pair.
The selection of the pole node is done by the source node. Thus pole routing can
be used as a hybrid source-controlled distributed routing algorithm and can be used
in conjunction with table based or hardwired routing implemented in the fabric.
The source can exercise control over load-balancing or have improved routability
around faulty nodes by appropriate choice of pole location (pole placement) to
route to a desired destination. We use minimal deterministic or partially adaptive
routing algorithm based on the turn-model as the baseline routing algorithm for the
“pre-pole” and “post-pole” stages. Each of these stages require just a single virtual
channel (or virtual network) to guarantee deadlock freedom. However, in order to
ensure that any cyclic channel dependence between pre-pole and post-pole phases
does not arise, an additional virtual channel (virtual network) is required. This is
because pre-pole to post-pole transition of channels by a message (at the pole router)
may have to use a disallowed turn in the baseline routing algorithm.
Next we discuss a few different scenarios in which pole routing can be useful.
Load-balancing with pole routing can be done in several ways. Here we discuss
static approaches for load balancing (i.e. we do not dynamically sense traffic con-
ditions and then determine appropriate pole-placement). One simple and practical
static approach is to look at a small number of pole placement options at a given
source, say up to two options, for each destination and cycling between them. When
a traffic pattern is static such an approach can lead to improved link utilization in
the network leading to higher network throughput.
An example of this is shown for transpose traffic in a 2D mesh network in
Fig. 14.6. In transpose traffic, a node S with coorinates (X,Y ) sends messages only
to destination node labeled D such that its coordinates are (Y, X). With deterministic
XY-routing support in the network, the bottom left and top right corners of the
network have the most congested links which limit the peak throughput. This can
be see in Fig. 14.6a.
One option to get better load-balance for transpose traffic is for each source
to alternately send messages to D choosing between pole-locations P0 = (X, X)
and P1 = (Y,Y ). While two virtual networks are required for pole-routing support
392 A.S. Vaidya et al.
Fig. 14.6 Link utilization on a 6 × 6 mesh network with Transpose traffic. (a) XY deterministic
routing algorithm causes congested (red) links at the corners of the mesh. (b) Load-balanced pole
routing using appropriate pole placement reduces link utilizations of formerly congested links
14 On Chip Network Routing for Tera-Scale Architectures 393
compared to the baseline case, they both are still assumed to support XY-routing,
however this specific choice of pole locations lead effectively to message being
routed on an YX path (using pole P0 and XY path using pole P1 ). The value of this
approach can be seen in Fig. 14.6b where the link utilizations are much lower at the
same load compared to that with deterministic XY routing.
Clearly, there are other pole placement choices available for improving the
network performance than the one discussed above. Later in this section, we
also discuss optimal pole-placement heuristics. While we do not discuss dynamic
load balancing, these approaches could also be coupled with a pole-routing based
solution.
We noted in the previous subsections that for routing a message between a given
source-destination pair using pole routing there are multiple choices for pole
placement. For a certain specified traffic patter it may be desirable to optimize
routing performance by making suitable pole placement choices both in the no-fault
cases as well as in cases where faults are present.
394 A.S. Vaidya et al.
Fig. 14.7 Examples showing pole-placement (P) for fault-avoidance between two different source
(S)-destination (D) pairs (a) (S = 2, D = 30, P = 29) (b) (S = 30, D = 8, P = 13). Nodes have
been relabeled to skip faulty nodes. XY routing is the baseline routing algorithm assumed for pole
routing
14 On Chip Network Routing for Tera-Scale Architectures 395
Here we present two heuristics to optimize pole placement for a given traffic
pattern:
Min-hops the first heuristic attempts to minimize the overall network latency in
terms of the average number of hops taken to route messages
Max-throughput the second heuristic attempts to maximize the overall throughput
for the given traffic pattern
We also assume for the following discussion that a deterministic routing algorithm
(such as XY routing) is used for both the pre-pole and post-pole routing virtual
networks.
Min-hops heuristic: In this heuristic, for all fault-free source-destination pairs that
are valid for a given traffic pattern, a pole location is picked such that it provides the
least utilized shortest fault-free path. The key step of the heuristic are as follows:
1. Select a source-destination pair (S, D), from amongst all fault-free and valid
source destination pairs for the given traffic pattern, at random.
2. Determine all valid pole positions that route a message from S to D using a
minimal path (in a faulty network these should be minimal fault avoiding paths).
3. Pole a pole location P from amongst all candidates, such that the choice of P
minimizes the utilization of the maximally used links amongst possible paths. If
P is not unique, pick amongst pole location candidates at random.
4. Pick the next source-destination pair at random and repeat the process until all
source destination pairs have been processed and all applicable pole locations
have been chosen.
The above approach aims to minimize average number of hops for a given traffic
pattern while simultaneously trying to improve link utilizations as well. In a network
with faults present, utilizations of some links can be skewed quite adversely and
this approach only helps somewhat. Figure 14.8a shows this behavior for uniform
random traffic in a mesh with multiple faults.
Max-throughput heuristic: This heuristic attempts to balance network link uti-
lization by making use of both minimal and non-minimal routes. As above, the
source destination pairs are chosen at random. However, pole locations are picked
from amongst all routes allowed by the routing algorithm (both minimal and non-
minimal). Specifically, a pole location is chosen in such a manner that the link
utilization of the maximally utilized link from amongst all candidate paths is
minimized. If multiple choices of pole locations lead to the same value lowest value
for the maximally utilized link, then the pole location with the shorter path length
(number of hops) is chosen. Ties are broken by selecting a pole location at random
from amongst equi-weight choices.
This heuristic is good for obtaining graceful degradation of performance (net-
work throughput) in the presence of faults. Link utilizations for the same set of
faults with the max-throughput heuristica can be significantly lower compared to
the min-hops heuristics. This can be seen in Fig. 14.8b.
396 A.S. Vaidya et al.
Fig. 14.8 Link utilization on a 6 × 6 mesh network with multiple faulty nodes and Uniform
random traffic. (a) Pole placement heuristic for minimizing inter node hop count. (b) Pole
placement heuristic to maximize the throughput for this traffic pattern. Non minimal routes may
be chosen
Src
Dst
Fig. 14.10 Algorithm sketch for performance isolation routing within a non-rectangular partition
P0, SM0 P1
Src Dst
Src
SM1
Dst
SM0 SM2
Fig. 14.11 Performance isolation with non-rectangular and rectangular shaped partitions
We have developed a full featured RTL implementation of the router using Verilog
for the purpose of robust validation of the micro-architecture and design as well
as to conduct a detailed performance characterization of the interconnect [7]. The
larger goal of the interconnect prototyping effort is to have a robust interconnect
which can then be interfaced to several production grade processor cores and their
cache coherence protocol engines.
Our FPGA based emulation environment is highly configurable for various
parameters including the number of MCs, performance and routing VCs and buffer
sizes. Along with each router in the prototype, a network node also implements
a network interface (NI) block for packet ingress and egress functionality as well
as a synthetic traffic generator. Uniform random, transpose, bit-complement, and
hotspot traffic patterns are currently supported by the traffic generator along with
several additional configurable parameters for controlling injection rates, MCs and
sizes. Various routing algorithms using programmable routing tables have been
implemented including basic XY routing, turn model based routing, load balanced
and adaptive routing, fault-tolerant routing and support for isolation of multiple
partitions as well as support for mesh and torus topologies.
Table 14.2 summarizes the key features implemented in the RTL.
Emulator control and visualization software system enables one to initialize the
network and run multiple experiments with various micro-architectural parameters,
routing algorithm tables and traffic patterns. The software environment enables one
to run each experiment for any given number of cycles after which a large array
of performance counter values that can be recorded. Registered values include
number of injected and ejected packets, packet latencies split by MCs, buffer
utilizations, bypass and arbitration success/failure rates per port, etc. A custom GUI
for control and performance visualization is used to run experiments interactively
and graphically render performance data in real time for each experiment.
Cycle accurate performance simulation played two distinct roles in our explorations:
(a) validation of ideas through performance analysis; (b) validation of functionality
before committing solutions to RTL and validation of RTL emulation at later stage.
The first role is critical for understanding the complexity of packets flowing through
the routers and experiencing the effect of many policies embedded in the micro-
architecture of the router. Even though intuition of architects plays the key role
in devising new router functionalities and micro-architecture, in most cases the
validation of the idea through cycle by cycle simulator transforms the original
intuition into new directions. In our effort we took two different approaches to
modeling of the network and routers.
• Flexible pipeline with stage to stage abstraction API: Limited capability of
adjusting pipeline stages such as combining stages into one.
• Event driven simulator: Every packet instigated hardware operation at each stage
of the pipeline schedules the next operation in the future.
The first approach enabled the representation of multiple router pipelines with
the same base code with simple modifications of the input pipeline configuration.
However, validation of all possible pipeline configurations with changes in one
was cumbersome. The second approach required bifurcations of the code for every
pipeline of interest. It was easy to maintain for a specific pipeline but required
reflection of new ideas in other live branches of the model. The appropriate choice
is a function of the overall framework of the studies, i.e. simultaneous analysis of
many alternative pipelines or detailed analysis of a few.
The second role is required since implementation of the idea in a higher
level of abstraction in the cycle accurate performance simulation hashes out the
details required for RTL implementation. In addition, debugging of the FPGA
implementation of the RTL is a complex task and requires a reference point for
comparison at every stage. Our performance simulator was made to be clock level
accurate so that it could be used as the reference point for this debugging. The format
and visualization of a subset of emulation/simulation output was made consistent
across the two environments for ease of debugging, i.e. consistent set of APIs for
configuration and visualizations. Multiple orders of simulation speed advantage of
the simulator was useful in testing ideas before committing the changes to RTL.
14 On Chip Network Routing for Tera-Scale Architectures 401
the configuration of the interconnect. Pattern of traffic in GPUs for various graphics
workloads and for GPGPU compute is partly regular and can benefit from workload
classification to identify worst case scenarios. The interdependence of messages in
the GPU traces were analyzed and traces were pre-processed to help adherence to
such dependencies in the replayed simulation on the interconnect simulator.
Trace classification in servers was mainly done based on the coherence traffic to
identify the primary nature of interconnect traffic such as second level cache miss
traffic hitting the shared third level cache, dominant traffic of misses from third level
cache diverted to memory controllers, IO traffic distribution, etc.
Due to large number of agents in the GPU architecture, the classification is
more complex and has distinct phase characteristics. The structured computation
and associated memory traffic in GPU workloads provided the opportunity to look
across vast number of workloads, identify the traffic as mix of dominant traffic flows
by small subsets of agents, and characterize the structure of such mix at different
phases of the compute. Selection of the worst traffic pattern from the interconnect
flow operation perspective can then be used to study a large selection of architecture
and micro-architecture alternatives in the context of the interconnect topology,
router design, and routing algorithms. Proper load balancing of the dominant traffic
in the worst case scenarios and repeating the analysis enables effective interactive
approach to topology and u-arch optimizations and demands fast simulation speeds
consuming a large selection of workloads. The capability of automating the process
of classification and selection of representative small phases of compute across all
traces of every workload category provides a tremendous opportunity to focus on
the interconnect design in an interactive mode.
Figure 14.12 shows one example GPU interconnection network configuration (torus
in horizontal dimension and mesh in the vertical dimension) with the associated
GPU agents placements. Some agents such as caches have multiple interfaces
connected to different routers to load balance the traffic through appropriate hashing
schemes. Considering the regularity of GPU architectures, this can be a subset
of a larger network encompassing it, i.e. extending the design in the vertical
dimension. This particular solution is a function of the number of agents, their
maximum injection and ejection bandwidth capabilities, real estate cost, power
dissipation considerations, floor planning considerations, metal layer availability,
modularity considerations across generations, priority of workloads from a business
perspective, etc.
14 On Chip Network Routing for Tera-Scale Architectures 403
Figure 14.13 show a plot of pair-wise communication traffic mix for a selection
of traces of different workloads, e.g. OpenCL, Direct X, etc. The x and y axis
represent the various agent IDs and the Z dimension represents the traffic during the
simulation time amongst a particular pair of agents. Dominant flows can be easily
identified with this graph but the timing of such traffic would be missing, i.e. this is
a cumulative plot across the whole simulation period.
Figure 14.14a, b represent the same set of data as in Fig. 14.13 as a function of
time. The x dimension represents time and the stack of various colors at each point
on the horizontal axis represent the traffic mix for that window of time represented at
that single point. In this example, each point on the x axis represents an average over
50,000 simulation cycles. In general, one is interested in grouping regions that have
very similar mix of pair wise communication traffic for a desired length of time and
selecting the one(s) that have the largest volume within such mix. Such a subsetting
approach will guarantee that the highest quantity of such distinct simultaneous
404 A.S. Vaidya et al.
Fig. 14.13 Example plot from traffic visualization tool showing pair-wise traffic mix amongst
various GPU nodes in the fabric
traffic flow is analyzed by the simulator. Proper classification of such distinct mixes
and selection of highest volume of traffic amongst each mix will ensure that worse
case interconnect flows are analyzed.
The color bands painted in the vertical direction in the background of the plots
below represents the cluster classification of the pair-wise traffic. The mix of pair
wise traffic in each cluster is the same but the overall magnitude of each member of
that cluster can be different. By selecting a small number of traffic mixes with a pre-
defined duration constraint, we can cover the simulation of all critical phases of all
traces and reduce the simulation time by 2–3 orders of magnitude across thousands
of traces.
The analysis of the representative phases requires additional support to expedite
the interactive cycle of design decision, analysis, and required improvement. We
developed a sophisticated interactive tool to examine the simulation replay of the
traces, analyze the queueing effects on the links, visually identify the hot spots,
and trace the link saturations directly to agents responsible for the traffic flow on
those links. The color representation of the traffic levels in the interconnect links
below quickly identifies potential hot spots for specific trace regions. Such trace
14 On Chip Network Routing for Tera-Scale Architectures 405
ogles2p0.grouping. [235p]
1.8e+06
1.6e+06
Pair wise
Total Flits Delivered Over Time
1.e+06
800000
600000
400000
200000
0
Clustering per short running window of time
1.8e+06
1.6e+06
Total Flits Delivered Over Time
1.4e+06
Pair wise
1.2e+06 traffic color
indicator
1.e+06
800000
600000
400000
200000
0
0 500000 1e+06 1.5e+06 2e+06 2.5e+06 3e+06 3.5e+06 4e+06 4.5e+06
Time relative to trace
Fig. 14.14 GPU stacked pair-wise traffic over time with clustering to identify similar traffic phases
subsets can be automatically identified by our tool when link utilizations pass preset
threshold levels. A tool view showing traffic analyses for multiple sets of traces
simultaneously is shown in Fig. 14.15.
Figure 14.16 shows interactive aspects of the tool. For example left portions of
the figure show the difference between injected traffic into the network (upper left
plot) versus ejected traffic (lower left plot). By overlaying the ejected traffic line over
the stacked of colored values of injection values one can identify potential queueing
effects. A high level of queueing will show the quantum of deviation between the
instantaneous injection and ejection. The user interface allows the architect to click
on an interconnect link to see the originators of traffic flows through that link and
the percentage contribution of each. Covering all scenarios which lead to queueing
effects, one can identify the required changes required in agents’ placement, routing
algorithm tuning, etc. After reflection of such changes in the router performance
model, the analysis can be repeated with minimal simulation time due to the huge
computation savings enabled by clustering approach. The focus of the study will
be limited to the small trace segments representing the highest level flows with
distinct mixture of pair wise communications between agents. We found our tool
to be extremely effectively in practice and shorten the interconnect design and
optimization cycle enormously.
406 A.S. Vaidya et al.
Fig. 14.15 GPU Traffic analysis tool with multi-trace view option helps identify traces which may
stress interconnect configuration
Fig. 14.16 Interactive traffic analysis and problem triage using our performance visualization tool
14 On Chip Network Routing for Tera-Scale Architectures 407
Many interesting interconnection networks and variations have been explored over
the last few decades in the academic circles and publications. The recent efforts in
restricting the ideas to on-chip interconnects has allowed architects to make use of
abundant wires, smaller clock synchronization overhead, lower error rate, and lower
power dissipation in links compared to off-chip networks, where chip pin counts
and power dissipation in the transceivers and links dominate design considerations.
At the same time, the connectivity restriction to a planar substrate, aggressive
power dissipation restrictions, and overall on-die design constraints dictated by
the compute and storage elements on-die and by design teams introduces new
additional requirements for practical solutions. A few such academic concepts have
surfaced in some form or other in industrial implementations, but mostly in the
extreme high-end and low volume part of the market. Otherwise, the adoption
in the volume market has remained limited to simplistic instantiations of some
of the elegant academic concepts. Better penetration of such interesting concepts
into actual designs requires more effort on feasible micro-architecture and physical
design implementations combined with creative validation techniques and tools to
address the needs of industrial design teams that are extremely schedule driven and
risk averse.
In this chapter, we have captured our experience with such efforts from the
topology, architecture, routing algorithms, micro-architecture support, FPGA proto-
typing, performance simulation, debugging methods and tools. Unfortunately due to
the scope and framework limitations, we have left out our physical implementation
evaluations which is a critical part of the decision making loop in such implemen-
tations but is a tight function of the specific overall chip design, e.g. diversity of
workload, overloading of network with independent traffic classes, sensitivity to
latency, QoS requirements, validation resource constraints, number of agents, chip
interconnect availability, power and real estate constraints, regularity of the design,
reuse strategy across market segments, late binding of design, etc.
Micro-architecture and pipeline design is a critical element of the on-die design.
The specific decisions made in the pipeline design are strong functions of the
required bandwidth, latency, power and die space constraints, routing options,
validation constraints, reusability in different segments, etc. The pipeline options
discussed in this chapter are a subset of a much larger exploration. For example,
latency tolerance of GPU environments provides flexibility in trading off frequency
versus latency, its structured and more predictable traffic flow allows removing some
of the routing flexibility requirements, its higher bandwidth requirements benefits
from richer topologies imposing higher number of virtual channels. In short, the
micro-architecture design and tuning within the vast set of design constraints is
a complex task and requires a wide selection of architecture/design/power/debug
tools at architects’ disposal. Unfortunately, literature is full of apple to orange com-
parisons for specific ideas and designs due to distinct requirements and constraints
in each project. Based on our experience, in many cases the real picture can turn out
to be quite different from the suggested findings.
408 A.S. Vaidya et al.
The routing algorithm section covered the benefits of the deterministic and adap-
tive approaches in realistic environments. Pole routing proved itself as a practical
solution to address complex issues usually contained to specific hot spots and avoid
a recourse to extreme complex solutions. Our real implementation and debugging of
the Pole routing in a realistic and huge FPGA emulation of a complete system is a
solid proof of its viability. At the same time, the adaptive routing solutions with Prof.
Duato’s theoretical basis has great potential but serious debugging and validation
challenges. Our practical implementation and optimization of a practical adaptive
routing mechanism offered an interesting solution but required major investment in
definition, debugging hooks, tools, and simulator and FPGA debugging. Adaptive
routing proved to address extreme hot spot conditions and avoid the required
tuning of deterministic solutions and agent placements with a priori knowledge of
diverse workload behaviors. A practical validation methodology is required prior to
committing an adaptive routing for a general interconnection design.
Considering the wide spread usage of virtualization in the server domain and
to some extent in the client domain, much higher level of integration on-die such
as powerful GPUs on the processor die, introduction of massive number of cores
and agents on-die such as network processors, etc., the sharing of interconnection
network for distinct independent traffic is critical. The chapter of performance
isolation provides practical trade-offs in providing network topology flexibility and
adaptability versus the feature design complexity. We have been able to implement
and emulate such capability in its full detail. However, one should pay close
attention to the potential conflicting requirements of this capability against the many
other desired routing optimization features. In some cases performance isolation
may require disabling of other features, e.g. adaptive routing.
Emulation of complex features prior to design commitment is critical for
complex networks. However, affordable FPGA emulation requires extensive tools
and capabilities to make the effort manageable and worthwhile. The emulation
section covers our experience with a 16 processor core emulation on a mesh/torus
interconnect, porting of Linux, and actual execution of multi-threaded applications.
Such effort required many creative solutions, including re-architecting the interrupt
interconnect (APIC bus) on a 2D/2.5D network. In practice one can get away
with a much simpler software stack for validation but capability of a full software
stack allows one to evaluate the real interconnect effects under realistic workloads
and systems. Emulation speeds are not fast enough to allow long execution of
applications but fast enough for representative segments of workloads with prior
sampling effort.
A fast, rich, and comprehensive cycle accurate simulator is the real required
backbone of many routing, architectural, micro-architectural, and power analysis
explorations. Creative visualization of the simulator results were absolutely critical
for our FPGA emulation and design exploration across the board. Effective and
creative visualization and interactivity of the tools allowed us to debug extremely
complex adaptive features, address sharpening the latency distributions under
complex u-arch techniques, and gain insight into potential new possible features.
14 On Chip Network Routing for Tera-Scale Architectures 409
Visualization was the key to quick optimization of our solutions in the GPU
interconnect through topology modifications, routing optimizations, agent place-
ment, traffic distribution, etc.
Acknowledgements Contributions and insights provided at various points in time by the follow-
ing individuals are gratefully acknowledged: Donglai Dai, Dongkook Park, Andres Mejia, Gaspar
Mora Porta, Roy Saharoy, Jay Jayasimha, Partha Kundu, Mani Ayyar and the late David James.
References
18. Intel Xeon Phi Coprocessor 5110P, Highly parallel processing to power your break-
through innovations. Weblink, http://www.intel.com/content/www/us/en/processors/xeon/
xeonphi-detail.html
19. Y. Tamir, G.L. Frazier, Dynamically-allocated multi-queue buffers for VLSI communication
switches. IEEE Trans. Comput. 41(6), 725–737 (1992)
20. A.S. Vaidya, A. Sivasubramaniam, C.R. Das, LAPSES: a recipe for high performance adaptive
router design, in Proceedings of the 5th International Symposium on High Performance
Computer Architecture (HPCA’99), Orlando (IEEE Computer Society, Washington, DC,
1999), pp. 236–243