University of Cincinnati: 07/11/2008 Arun Janarthanan Doctor of Philosophy Computer Engineering
University of Cincinnati: 07/11/2008 Arun Janarthanan Doctor of Philosophy Computer Engineering
07/11/2008
Date:___________________
ARUN JANARTHANAN
I, _________________________________________________________,
hereby submit this work as part of the requirements for the degree of:
DOCTOR OF PHILOSOPHY
in:
COMPUTER ENGINEERING
It is entitled:
NETWORKS-ON-CHIP BASED HIGH PERFORMANCE
COMMUNICATION ARCHITECTURES FOR FPGAS
DOCTOR OF PHILOSOPHY
in the Department of
Electrical and Computer Engineering and Computer Science
of the College of Engineering.
by
Arun Janarthanan
We propose a novel micro-architecture for a hybrid two-layer router that supports both
packet-switched communications, across its local and directional ports, as well as, time
multiplexed circuit-switched communications among the multiple IP cores directly
connected to it. Results from place and route VHDL models of the advanced router
architecture show an average improvement of 20.4% in NoC bandwidth (maximum of
24% compared to a traditional NoC). We parameterize the hybrid router model over the
number of ports, channel width and bRAM depth and develop a library of network
components (MoClib Library).
Synthesizing an NoC topology for FPGAs from the above library of network components
requires a complex trade-off among switch complexity, area available and bandwidth
capacity. We develop an algorithm and an application-generic design flow that includes
required bandwidth and area in the cost function and synthesizes the NoC topology for
FPGAs. For a set of real application and synthetic benchmarks, our approach shows an
average reduction of 21.6% in FPGA area (maximum of 26%) for equivalent bandwidth
constraints when compared with a baseline approach.
Interconnecting IP cores along with our NoC requires a glue logic that can connect
different versions of the router to IPs. To accomplish this, we design a customizable
Network Interface that is compatible with our 2-layer hybrid router. Towards capturing
real core implementation effects, we characterize a library of soft IP cores and implement
a typical image compression application on our FPGA. Through experiments we
determine the area and power overhead of our on-chip network on an FPGA when
implemented along with a typical application. Further by accurately modeling our On-
chip network for area, delay and power, we develop a platform that could be used to
floorplan a complete multi-processor application along with the NoC.
Acknowledgements
Firstly, I would like to express my gratitude towards my advisor, Prof.Tomko for shaping
up my graduate studies. Your strong directions and compassion has gone a long way in
helping me develop valuable academic and personal life qualities. Thanks for helping me
meet my deadlines and reviewing papers over very short notices. I consider myself very
fortunate to take courses and be constantly associated with Prof. Vemuri and his lab.
The high standards you set in the courses and discussions remained as a stable platform
for my research work. I also thank my other committee members, Prof. Carter, Prof.
Jone and Prof. Srinivasan for reviewing my work and giving good feedback.
I consider myself very lucky to have Prof.Srinivasan as my mentor at every stage
in my academic progress from high school to declaring my dissertation complete. I
would like to acknowledge the support extended by Xilinx and Mentorgraphics through
their university program. I would like to acknowledge our department staff Rob Montjoy
for efficiently handling all computing and licensing issues and Julie Muenchen for her
patience in ensuring that we conform to department regulations and formalities.
Sundaresan for always finding time to help out in any issue. Thanks to Jayanth for
imparting high levels of optimism on anything we talk anytime.
In a five year long PhD program it is extremely important to keep in touch with
iv
a lot of people and I thank SABHA for serving as a wonderful medium to interact
with students and community. I carry a rich variety of learning from serving for two
consecutive years in the SABHA’s committee. I will cherish the experience for years to
come.
I am thankful to UCs relaxed ambience. It covered all of our expenses leaving us with
enough money for frequent travel. All the never-ending road trip memories that I share
with many in UC will be etched in my mind for ever. UCs on-campus recreational and
housing facilities were excellent as well. I am indebted to my roommates Aravind and
Jagadish for making a wonderful home far away from home also for being great cooks
and more importantly agreeing to share cooking turns with me. I know that I have made
some life long bond of friendship while I was at UC Prasanna, Raghav, Ramki, Payal
paati who for all her life prayed for her grandsons success and moved to a more peaceful
world when I was presenting this doctoral dissertation.
v
Contents
1 Introduction 1
vi
2.1.3 Circuit-Switched Interconnects . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Packet-Switched Networks . . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.4 Arbitration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.5 Buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Current Research in NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Industrial Applications with NoCs . . . . . . . . . . . . . . . . . . 15
3.4.5 Arbiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Router Micro-Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5.1 Packet Description . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5.2 Input Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
vii
3.6 Results: Functional Simulation . . . . . . . . . . . . . . . . . . . . . . . 25
3.6.1 Common Clock Design . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2.1 Packetization and Control Overheads . . . . . . . . . . . . . . . . 43
5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4 Architecture Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
viii
5.4.4 Design Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.5 Architectural Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7 Experimental Platform 66
ix
8.4.3 Candidate Topology Selection . . . . . . . . . . . . . . . . . . . . 83
8.4.4 Area Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 84
x
10.1.3 Hybrid 2-Layer Architecture . . . . . . . . . . . . . . . . . . . . . 114
10.1.4 Performance and Power Analysis . . . . . . . . . . . . . . . . . . 115
xi
List of Tables
xii
List of Figures
xiii
6.5 Dynamic Power Breakdown of an 8-port Hybrid Router . . . . . . . . . . 62
6.6 P-Layer Ports (Switch Size) Vs Dynamic Power (mW) . . . . . . . . . . 63
xiv
9.11 NoC Implementation Alternatives . . . . . . . . . . . . . . . . . . . . . . 110
9.12 JPEG Configuration: Area and Power Overhead Analysis . . . . . . . . . 112
xv
Chapter 1
Introduction
pled with higher operating frequencies enable FPGAs to replace ASICs in several high
performance applications. In this chapter, we discuss platform FPGAs from an SoC per-
spective and present the inherent performance limitation due to scaling of device sizes.
We then introduce the current solutions proposed to handle the performance bottleneck,
with an emphasis on the Network-on-Chip paradigm.
Traditional FPGAs are comprised of a large amount of programmable logic and inter-
connects to implement user applications. Recently, a coarse-grained approach has been
adopted by FPGA companies that combines the fine-grained reconfigurable resources
with hard embedded cores that match their ASIC counterparts in performance, power
and area. FPGAs are presently utilized in various application domains, from mobile
portable devices to space devices. They increasingly replace ASICs as the choice of tar-
get technology due to their increasing device sizes and operating frequency. The following
are some of the capabilities that a present platform FPGA sustains:
1
2
in a single design. Due to the FPGA capabilities listed above and high time-to-market
pressures, complex SoC designs are increasingly targeted to FPGA.
FPGA device manufacturers have achieved large device sizes using 65nm technology. In
this nano-meter technology, the interconnects scale poorly compared to transistors. As a
result, the global interconnect in the design sustains a high performance delay compared
to transistors. Figure 1.1 presents the difference between local and global interconnect
delay due to technology scaling [2].
The above performance limitation will be more pertinent to FPGAs (than ASICs)
due to the programmable nature of their interconnects (and large device sizes). As a
result, interconnects in FPGAs are accounting for a significant portion of total circuit
delay and power. As design gets bigger, it is therefore very difficult to maintain high
clock rates. Networks-on-Chip (NoC) is a design paradigm proposed to contend with
this inherent performance bottleneck. Figure 1.2 shows the projected trend for on-chip
network based design methods [2]. NoC on FPGAs are an active area of research and
holds great promise for meeting present SoC communication needs.
3
Traditionally cores in FPGAs are connected using bus-based architectures. NoCs are
proposed as an alternative to eliminate the inherent performance bottleneck in bus-based
architectures. In addition to an increase in performance, NoCs present a host of other
advantages, particularly for FPGA designs. In this section, we discuss these advantages
for implementing multi-core applications with NoCs.
1.3.1 Scalability
In NoCs, the IP cores are connected to the network through routers (network backbone).
As the communication between routers is standardized, addition of more cores does not
have an impact on the rest of the design. On the other hand, the bus-based architectures
are poorly scalable with the number of cores. With increasing number of cores and
complexity in arbitration logic, the operating frequency of bus-based communication
4
degrades.
FPGAs are a suitable implementation for portable devices and therefore, low power
operation is a critical requirement. As opposed to bus based architectures, NoCs consume
low power due to less switched capacitance (shorter lines). Further, due to a high level
of parallelism in communication, the overall energy requirement is comparatively low.
Due to a hierarchical approach, there is a rapid reduction in design and verification time
associated with NoCs. Further, development of IP cores can be independent from the
application/other parts of the design. Table 1.1 shows the future trend for design re-
use [2]. There is a steady increase expected in the % of design component re-used, which
supports our choice of NoC as a design paradigm.
5
task without affecting the execution of other parts of the design. Table 1.1 presents the
expected reconfiguration trend [2]. A steady increase is expected in the % of reconfig-
urable components in an SoC. A primary challenge in the reconfigurable computing is
concurrent design of the communication and computation subsystems. The NoC ap-
proach to design in FPGAs inherently separates these two aspects of the design by pro-
viding standard interfaces to the cores. Several current research efforts [4] [5] advocate
NoC design on FPGAs for efficient module replacement and design re-use
architectures for FPGAs. Once completely developed, this NoC framework can efficiently
replace the traditional communication architecture.
1.4.1 MoCReS
An FPGA based on-chip network has a unique set of design goals that includes satis-
fying the bandwidth requirements with a minimum (limited) resource availability. We
6
only 282 Virtex-4 slices (a marginal 0.57% of XC4VLX100) and operates at 357 MHz
supporting a competitive data rate of 2.85 Gbit/s.
The NoC communication architecture competes for resources with the user application
and also sustains a power overhead. We determine the power consumption of the NoC
framework on FPGA. Further, we analyze the power-trade-offs associated with our design
novelties by comparing it with a baseline approach, implemented on the same target
device.
A strict packet-switched NoC sustains a high serialization overhead. To offset these per-
formance overheads, we develop a novel hybrid two-layer router architecture that sup-
ports packet-switching for inter router transfers and time-multiplexed circuit-switching
for IP cores connected to the same router. The advanced router architecture achieves
an average improvement of 20.4% in NoC bandwidth (maximum of 24% compared to a
Traditional design flow for FPGAs support sophisticated CAD tools to achieve design
closure. However, multi-core designs do not have a standardized CAD flow for FPGAs.
Moreover, the vast heterogeneity of the FPGA device further complicates the design flow
for NoC based FPGA designs. Further, CAD solutions for multi-core applications cannot
7
be borrowed from ASIC domain, due to the inherent differences in the underlying archi-
tecture. As a part of this research, we design an algorithm to effectively automate the
NoC design cycle for FPGAs. For any given application as a task graph, our integrated
synthesis framework determines a suitable NoC topology that satisfies the bandwidth
requirements, while optimizing for the area overhead.
Implementing a complete SoC development flow using FPGA based NoCs require a
The relevant NoC and FPGA background material required for this research is presented
in Chapter 2. Further, this chapter includes a survey of alternate approaches considered
in current research.
Chapter 3 describes the FPGA based MoCReS framework that is developed as a part
of this research. The chapter also presents the area/performance trade-offs for various
versions of the router. We utilize the NoC framework described in this chapter to conduct
all the experiments presented in the following chapters.
We analyze the power dissipated in FPGA based NoC framework and present its descrip-
tion and results in Chapter 4. Further, a comparison of our MoCReS framework with a
baseline NoC design in terms of power is featured in this chapter.
8
Chapter 5 presents a hybrid two-layer router architecture. The advantages behind the
novel architecture, along with the design issues involved are presented in this chapter.
The area, performance metrics of the hybrid router architecture are characterized and
stored along with parameterized designs in a MoClib NoC component library.
Chapter 6 presents a detailed analysis of performance and power of our novel NoC frame-
work. Through detailed traffic analysis and comparisons, we present the performance
gain obtained in our hybrid router framework. This chapter concludes with a component
based NoC power analysis and a comparison of switch and link power in FPGA based
NoCs.
The experimental platform is outlined in Chapter 7. It includes a description of the
experimental platform for this thesis, including the CAD tools, software and hardware
used. We also present the application and synthetic benchmarks that we utilized to
extract the area/performance results in the experiments.
Chapter 8 formally presents the design and implementation of the CAD tool developed
to perform automatic topology synthesis. It presents the cost constraints and trade-offs
involved in the algorithm during design space exploration. The chapter also includes the
results obtained for a variety of application and synthetic benchmarks.
In Chapter 9, we present the IP implementation methodologies and characterize a library
of frequently used IP cores in FPGAs. The design goals behind a network interface
particularly suitable for our NoC is also presented. Furthermore, the chapter includes a
power-performance study into floorplanning our on-chip network in FPGAs.
Finally, Chapter 10 summarizes all the contributions made in this dissertation, and
outlines the future research directions.
Chapter 2
emphasis to FPGA based on-chip networks, much of the background material presented
in this chapter is also applicable to ASIC networks. In this chapter, we first present
the alternatives in communication architecture, followed by a detailed description of a
Network-on-Chip. We conclude the chapter by discussing recent research in the NoC
area.
• Throughput Available
• Signal Integrity
9
10
• Bandwidth Guarantee
Based on the above trade-offs, the interconnection mechanisms can be broadly clas-
Present FPGA devices support dedicated interconnects. These are the spatially dis-
tributed FPGA resources configured through programmable switches. The latency of
this type of communication is very low and there is guaranteed bandwidth to support
the communications. However, the interconnect utilization is extremely low, as the dedi-
cated connections are almost never time-multiplexed for a different communication. With
the limited resource available within the FPGA and with increasing design complexities,
it is challenging to preserve signal integrity with this form of interconnects. Present
Virtex [1] device architecture supports this type of interconnects.
Time-muxed interconnects offer a high throughput connections with very high intercon-
nect utilization. This approach requires all the communication schedules to be known off-
line. With increasing number of core communications, the area requirement for context
memory offsets the gain achieved in throughput and interconnect utilization. Research
for larger designs is a big challenge as this circuit-switched connection, once established,
follows a synchronous scheme. PNoC [7] advocates this type of interconnects for FPGAs.
The central idea with this structure is to transport data across modules in the form of
packets. Multiple packets are in flight from several source, destination pairs, thereby
increasing the overall performance. Similar to circuit-switched interconnects, this form
of communication is established online and does not require static scheduling. The
The main modules present in an on-chip network are, the IP cores (computation units),
network, the NI grants the request of the downstream router to receive the packets. The
NI is customized to suit the requirements of a particular network backbone. Research is
underway to standardize these network interfaces and IP communication protocols [9].
12
Network-on-Chip Component The router forms the heart of the NoC backbone. It
is responsible for transporting packets that originate from the IP cores. Traditionally,
a router in a mesh network has four directional ports (North, East, South and West)
to communicate with the neighboring routers. Further, it has at least one local port
through which an IP core is interfaced to the network. Upon receiving a packet, the
router decodes, buffers and routes it in the appropriate direction based on the destination
node. Throughout this chapter we use the terms router and switch interchangeably.
An on-chip interconnection network can be described by a set of design choices called
network aspects. We describe network aspects in the following section.
Choice of suitable network aspects have a large impact on the design metrics of the NoC.
Topology, Flow Control, Arbitration, Buffering and Routing are the main aspects of a
network.
2.3.1 Topology
The three main alternatives in flow control of an NoC are, Store and Forward, Virtual
Cut-Through, and Wormhole. In the Store & Forward technique, the out-going packet
is completely buffered at the downstream router before making the next hop. This tech-
nique used in [11] sustains a high packet latency (directly proportional to packet size).
In contrast to this, Virtual Cut-Through technique has low latency as the head flit pro-
gresses without waiting for the rest of the packet. However, the buffer requirements are
in terms of multiples of packet sizes for both the above approaches. The wormhole tech-
nique, which operates on flits, sustains low latency with minimum buffer requirements,
with a packet residing across multiple nodes, thereby increasing the complexity of the
switch.
2.3.3 Routing
The routing mechanism determines the decisions taken when a packet is in flight. The
route established for a packet-based communication can be determined either at the
source node (Source Routing) or independently across the routers of the network (Dis-
tributed Routing). In the case of source routing, the first flit in a packet is the header,
that contains the entire route of the packet. Upon decoding the header, the router passes
the packet to the appropriate downstream router. In the case of distributed routing, the
route decisions can be made throughout the network by the routers (depending on con-
gestion and other network conditions).
Another classification in the routing mechanism is based on adaptivity to network con-
ditions. In Deterministic routing, the route between any (source,destination) pair is
always the same. XY routing is a widely used low complexity routing mechanism that
is deadlock free. On the flip side, a deterministic routing mechanism cannot alter the
routes based on network traffic conditions. Paths taken using Adaptive routing can vary
and be non-minimal. However, the switch incurs additional complexity to support this
14
mechanism.
2.3.4 Arbitration
Within a router, arbitration is required while servicing conflicting requests. When mul-
tiple input ports request a common output port, the requested output port determines
the order in which requests are acknowledged. The scheduling can be either static (pre-
determined) or dynamic. Round-Robin arbitration is a popular technique that provides
a fair dynamic granting scheme. Additionally, Quality-of-Service (QoS) guarantees can
2.3.5 Buffering
At the intermediate nodes, packets need to be temporarily stored, while waiting for a
channel access. Based on where the buffers are placed in the routers, Input Buffering and
Output Buffering are the two main classifications. While buffering a packet at the input
of a router, additional requests from upstream routers are not granted. This situation
is called Head-of-Line (HOL) blocking. Using output buffering, the HOL blocking is
avoided, thereby decreasing the average latency of the packets.
In this section, we summarize the recent research work in the area of NoCs. A brief
account of current NoC applications in industry is also presented.
The International Technology Roadmap for Semiconductors 2005 [2] is the first to
address the inherent performance bottleneck due to poor interconnect scaling. The con-
cept of routing packets instead of wires to gain in performance was introduced by Dally
et. al [12] and was later formally presented as a solution paradigm by Benini et al. [13].
15
The first proof-of-concept was presented by Kumar et al. [14] by implementing a complete
NoC framework. Since then, there has been several academic and industrial advance-
The NoC framework [15] by Philips is one of the first reported industry implementation
of an NoC. Sony Entertainment Play Station PS3 has partnered with IBM [16] for their
network-on-chip implementation of the multi-core design. Further, Sonics [17] has been
Marescaux et al. [4] have applied the NoC paradigm on a Virtex device to enable multi-
tasking by tile-based reconfiguration. The Hermes [20] NoC platform was developed par-
ticularly for FPGAs to enable dynamic reconfiguration. Until recently the potential of
NoCs to address the performance issues in FPGAs was left unexplored. Bartic et al. [21]
present an adaptive NoC design for FPGAs and analyze its implementation issues. Sal-
dana et al. [10] address multi-processor designs in FPGAs by considering various NoC
topologies by scaling the number of IP cores. Hilton et al. [7] design a flexible circuit-
switched NoC for FPGAs. Kapre et al. [22] compare the suitability of packet and circuit
switching for FPGAs.
In addition to the current research presented in this chapter, we also refer to related
work alongside the contributions presented in the subsequent chapters.
Chapter 3
3.1 Introduction
The key idea is to implement a low area and high performance packet-switched NoC
framework for FPGAs. The central component of the NoC (router) can support inde-
pendent operating frequencies, dictated by placement and routing constraints in FPGA.
Moreover, the router supports a low latency virtual cut-through flow control for vari-
able packet sizes. Our 5-port router has an area overhead of only 282 Virtex-4 slices (a
marginal 0.57% of logic resources of an XC4VLX100 device) and can operate as high
as 357 MHz supporting a competitive data rate of 2.85 Gbit/s. We gain in router area
and performance by reducing the logic depth of the central arbiter and cross point ma-
trix. We utilize our router to construct a mesh based multi-clock on-FPGA NoC. We also
demonstrate its functionality and characterize performance, area and power of several
versions of the router.
16
17
minimum area overhead, maximum operating frequency and low latency of operation.
In this section, we compare our router to other proposed FPGA based NoC routers [11]
[23] [24]. In [11], the authors present a light-weight FPGA based parallel router that
uses store and forward flow control. This router has the disadvantage of high latency
(directly proportional to packet size) and it supports only fixed packet sizes. The modified
header in our virtual cut-through router overcomes the above mentioned disadvantages
by encoding the packet size as a fraction of the required FIFO depth. This technique
ensures low latency of operation and improved buffer utilization as a result of supporting
variable packet sizes. Research in [23] [24] [4] [25] presents wormhole based routers using
XY deterministic routing. Though the wormhole routing limits the buffer requirements,
it increases the area consumed due to its complexity thus limiting the logic available for
IP implementation in FPGAs. The 5 port wormhole router presented in [23] consumes
1832 Virtex II slices and operates at 66 MHz. Such an increase in router area degrades its
performance and increases the power consumed. Further, [24] [4] support variable packet
sizes, but with an additional header flit overhead as compared to our router. We will
show that our router consumes fewer resources than the above designs when implemented
in a similar target device. Table 3.1 presents a comparison of MoCReS with alternate
designs in terms of area (in slices) and operating frequency (MHz). The number of slices
utilized is a standard metric to compare FPGA area. In terms of logic, a Virtex-II slice is
equivalent to a Virtex-4 slice. The high operating frequency obtained through MoCReS is
also attributed to the low interconnect delays in Virtex-4 (advanced) architectures. The
operating frequency of our router in a comparable Virtex-II pro device was 172 MHz.
implemented in an FPGA NoC determines the overall network operating frequency and
degrades its performance [25] [10]. We overcome this limitation by enabling the router to
support a multi-clock framework. Kim et al. [26] propose to interface the local cores op-
erating on individual frequencies with the network using asynchronous FIFOs for ASICs.
If implemented on FPGA, the slowest router would still dictate the operating frequency
of the network. Moreover, [26] buffers the entire packet from the local core before for-
warding it to the network. On the other hand, our router follows a modified multi-clock
virtual cut-through approach hence sustaining low latency.
The key design objective is to minimize the area consumed by the router, which is the
central component of a network. Reducing the logic ensures sufficient resources for SoC
design in FPGA and also minimizes power overhead. Secondly, we target to increase
the operating frequency of the router keeping the network latency to a minimum. It
is essential to improve the network bandwidth and avoid the bottleneck present in bus
based architectures. The final objective is to operate multiple routers on independent
clock frequencies thereby preventing the slowest router from restricting the operating
frequency of the network. Figure 3.1 presents a multi-clock framework with routers
functioning at individual frequencies. Dual ported input buffers are used to cross clock
19
L1
L3
CLK_R1 CLK_R3
R (0,1) R (1,1)
L0
L2
Input Ports
Central Arbiter + Crosspoint
R (0,0)
R (1,0)
CLK_R0 CLK_R2
The network topology along with the flow control, routing, buffering and arbitration
We choose a mesh topology for our light-weight network. Mesh networks have a minimum
area overhead [10] (reduced number of nets) and low power consumption. In addition,
area scales linearly with the number of nodes and channel width in a mesh. A mesh also
maps well to the underlying routing structure of FPGA. Hence, choosing mesh networks
reduces the congestion in FPGA logic and routing which minimizes power consumption.
20
Virtual cut-through and wormhole technique (unlike Store and Forward) have a packet
latency that is only proportional to the path length. However, the complexity of a
wormhole router as compared to a virtual cut-through router is less suitable for light-
weight implementation. We have chosen a virtual cut-through flow control mechanism for
our router. This scheme supports higher throughput than wormhole routing by efficiently
releasing the upstream buffers during blockages. Furthermore, virtual cut-through flow
control supports high channel utilization with low latency and does not reserve physical
channels.
3.4.3 Routing
We choose the deadlock free XY routing for our switch. The simplicity of the XY
routing adds little overhead to the header decoding logic. Hence, XY routing is suitable
for implementing our area efficient router on FPGA.
3.4.4 Buffering
We buffer incoming packets only at the input ports. Although the input buffering intro-
duces the head-of-line problem, it leads to a low area overhead. In addition to buffering
the incoming flits, the input buffers also provide a framework to implement a multi-clock
network.
3.4.5 Arbiter
To ensure fairness, the competing input ports are allocated based on a simple round
robin approach. The priority of the last served/denied port is placed at the end of the
queue. The FIFO virtual channels also follow a round robin approach when switching
packets to downstream.
21
The MoCReS router consists of five Input ports, Crosspoint matrix and Central arbiter.
Except for the header decoding logic, the five input ports are identical. As we adopt
a flow control with virtual channels (VCs), the input port contains arbitration logic
for multiple VCs. The input port also contains input buffers to store the incoming
packet. The MoCReS architecture has been developed in collaboration with another
colleague [27]. A more detailed description of the architecture can be found in the above
thesis [27].
For a FIFO depth of 16 (utilizing the Xilinx block RAM), the packet size can vary
between 24 bits and 128 bits with a header overhead of only 1 flit per packet. The flit
size is fixed at 8 bits. The header contains the address of the destination router, flit type
and packet fraction. Our virtual cut-through flow control needs one bit to specify the
flit type. The tail bit is set on the flit prior to the last flit to send the terminate signal
without wasting a clock cycle.
The router supports variable packet sizes by encoding the packet size as a fraction
of the required blockRAM (bRAM) depth (packet fraction) in its header. The Network
Interface (NI) in each IP core takes the onus of storing the fraction in the packets header.
The fraction bits are complemented before storing to enable efficient comparison against
write count of the FIFOs. The number of fraction bits and flit width can be increased
if a higher packet granularity is desired. Increasing the number of fraction bits also
improves the buffer utilization with a marginal area overhead. The remaining bits in the
header are reserved to implement priority based flow control. An advanced version of the
router could utilize the remaining header bits to incorporate Quality-of-Service (QoS)
and offset the impact of contention latency.
22
Tail - 1
1
Tail Flit
Rd_enA
emptyA
Cross Point Matrix
N E
2:1
4:1
Wr_cntA BRAM FIFO A BRAM Arbiter L
2
Fract_in VC Arbiter Mux_Sel 4:1
Req_in S 2 W
Ack_out 4:1 2:1
2
Channel
Data_In 2 2 2
Msel_L Msel_N Msel_S Msel_E Msel_W
Demux_Sel
Header
Central Arbiter
Wr_cntB BRAM FIFO B Decoder
Write Clock Read Clock Grnt
Wr_enB
Rd_enB
emptyB
Req_in_W
Grnt_in_W
Req_in_S
Req_in_N
Req_in_E
Req_in_L
Grnt_in_S
Grnt_in_N
Grnt_in_E
Grnt_in_L
Figure 3.3: Router Micro-Architecture (Input Port, Cross-Point, Central Arbiter)
The router has a set of input ports, namely, Local (L), North(N), East(E), South(S) and
West(W) to communicate with the local core and neighboring routers. Each input port
can support multiplexed virtual channels, associated arbiters, and a header decoding logic
to make routing decisions. The three main components of the input port (Figure 3.3)
are, the Virtual Channel Selector, the FIFO bRAMs and the bRAM arbiter.
A. Virtual Channel Selector (VC Selector): The decision on the availability of
space in input buffers is made by the VC selector. It receives along with the header, the
23
size of the packet in a coded form. Upon receiving this data from the upstream router
or core, the VC selector compares the size of the packet with the available size in the
least occupied among the virtual channels. This is done by comparing the incoming size
against the FIFOs write count. If adequate space existed the VC selector acknowledges
the request back upstream. The VC selector, in the process, also sets the input de-
multiplexer. The input de-multiplexer is used to route the packet from the input channel
1. Buffer the incoming packet partially or fully and when the downstream switch is
available, forward the head and subsequent flits.
2. Demarcate the router to core and router to router frequencies, hence supporting a
multi-clock network design.
3. Support variable packet sizes by enabling the arbiter to monitor the write count
information.
Our router hence does not restrict the size of the packet which might lead to inefficient
transfer of data. The variable packet size capability comes at a marginal (less than 5%)
increase in the area of the switch. This overhead is acceptable, considering the inefficiency
in performance and power, due to the padding of smaller packets with empty flits.
An efficient way of implementing f ull/empty logic is required in the case of having
buffers to separate clock domains. We synchronize the control signals by a) Sending the
granularity of the packet as a fraction of FIFO size b) Setting the tail bit on the flit
before the last to terminate the connection without wasting a clock cycle.
Since the minimum supported FIFO depth in our target FPGA is 16, it is appropriate
to buffer the entire packet during contention (virtual cut-through). The write count,
24
empty and f ull status signals are integrated into the FIFO with minimum additional
logic. The capability to store flits from multiple packets in one buffer improves the
FIFO utilization. Also, we gain in network throughput as the successive flits release the
upstream buffer as soon as the head advances.
C. bRAM Arbiter: The input port also contains the control logic to make arbitra-
tion decisions. A simple round-robin approach is followed when choosing a non-empty
bRAM. Upon choosing a bRAM, the FSM pops the head flit, decodes its destination
and sends appropriate requests. XY routing is adopted in our router which simplifies the
decoder logic significantly. The number of outgoing request lines are reduced according
to the connections that XY routing permits.
Head Decoder: In XY routing, the head flit travels in the X direction and once it
reaches the destination X, it travels in the Y direction. All subsequent flits follow the
header flit in a pipelined fashion. Due to packets traveling in the X direction completely
before Y, a request is never sent from the North and South ports to the downstream
East and West ports. This nature of XY routing is used to reduce the amount of logic
in a) the Header Decoder b) the Cross Point Matrix and c) the Central Arbiter. The
above simplification of the logic translates into significant FPGA slice reduction.
We design a multiplexer based cross point matrix to minimize the area. An alternative
would be to support cross point connections for each de-multiplexed virtual channels. The
latter approach, which produces high network throughput, adds a significant complexity
to the switch. The cross point supports parallel connections between exclusive input and
output ports and is used by the central arbiter to support simultaneous requests. Not
all cross point connections are utilized by the XY routing. After optimizing the logic, we
implement the switch with simple 4 and 2-input multiplexers (for L, N, S and E,W ports
respectively). The above optimization reduces the cross point area to only 32 slices and
25
To ensure fairness, the competing input ports are allocated based on a simple round
robin approach. The last served/denied port is given lowest priority placed at the end of
the queue. We gain in router performance by a) reducing the logic in the central arbiter,
as we determined that it appears at the critical path of our router and b) centralizing
the arbiter which reduces the number of req/grant signals required. Reducing the logic
and routing also minimizes the congestion in the design. This translates into gain in
performance and power.
We minimize the logic in the critical path by reducing the number of service states
in the central arbiter for the requesting downstream input. The East input port will
be requested only by the West and Local ports and similarly, West input port only
by East and Local ports. Also, it is sufficient if only one grant reaches the requesting
input port for all its requests. This reduces the number of nets to be routed hence
minimizing the congestion. We achieve an operating frequency of 357 MHz without
queuing the simultaneous requests for downstream input ports, i.e we enable the arbiter
to handle multiple requests simultaneously. Upon granting an input port, the central
arbiter configures the multiplexers in the cross point matrix to establish a connection.
In this section, we present the functional simulation results of our design in an XC4VLX100-
11 device [1], on a Nallatech BenDAT AT M [3] development board. We use Xilinx ISE
8.2i to synthesize, place and route our design. We validate the functionality of our design
using Modelsim 6.1c [28] and present the results below. We functionally simulate two
versions of MoCReS: a common clock (synchronous) version and a multi-clock version.
26
CLK
L_in 00 10 A2 33 F2
N_in 00 12 A2 63 62
E_in 00 0A 42 23 22
S_in 00 1A 52 33 32
W_in 00 04 72 53 52
W_out 00 10 A2 33 F2
L_out 00 12 A2 63 62
E_out 00 04 72 53 52
N_out 00 1A 52 33 32
S_out 00 0A 42 23 22
60 ns 75 ns 100 ns
L_CLK
L_in 00 10 33 F2
N_CLK
N_in 00 12 63 62
W_CLK
W_in 00 14 53 52
S_CLK
S_in 00 1A 33 32
E_CLK
E_in 00 0A 23 22
Rd_CLK
W_out 00 10 33 F2
L_out 00 12 63 62
E_out 00 14 53 52
N_out 00 1A 33 32
S_out 00 0A 23 22
The router follows virtual cut-through flow control, based on a simple request/acknowledge
protocol. Figure 3.4(a) presents the simulation results of our standalone router operating
on a common clock. Our central arbiter and cross point are capable of establishing par-
allel input port connections without clock penalty. It can be seen that the flits coming in
through the five input ports are simultaneously switched in the appropriate directions.
Figure 3.4(b) shows the operation of our router when each input port receives data
at different frequencies. Once the empty signal of a FIFO is pulled low, the bRAM
arbiter decodes the header at the router’s read frequency. It can be seen that outgoing
packets are synchronized with the read clock and follow an order similar to their incoming
frequencies.
We choose area and performance as the two design metrics to be characterized for our
MoCReS design. A brief description of the experimental platform constructed for the
area-performance study is also presented below along with the results obtained.
28
700
600
500
400
Area (Slices)
300
Basic(1VC+CC)
1VC+MC
2VC+CC
200
2VC+MC
100
0
0 10 20 30 40 50 60 70
Channel Width (bits)
We synthesize, place and route the structural VHDL model of our router and present an
analysis of its FPGA resource utilization in this section. Upon tightly constraining the
area using Xilinx PACE tool [1], the basic version (1 Virtual channel + Common Clock)
of our router consumes 282 Virtex-4 slices (558 LUTs, 289 Slice FFs) which correspond
to a marginal 0.57% of our target FPGA device (XCV4LX100). Table 3.2 presents the
results from synthesis of the basic version of our router, identified as 1VC + CC.
To characterize our common and multi-clock (MC) router for area, we develop three
more versions of it by varying the number of virtual channels (VC). Figure 3.5 presents
the scaling of area versus channel width for various versions of the router. An increase
in the channel width causes a significant increase in the area of the router, due to scaling
of the cross point matrix. This increase in router area could be significant if the cross
point occupies a larger area as in most of the designs. However, the above mentioned
29
4
x 10
2
1.8
1.6
1.4
Logic Overhead (LUTs)
1.2 Routing Overhead (Nets)
0.8
0.6
0.4
0.2
0
0 10 20 30 40 50 60 70
Channel Width (Bits)
disadvantage is reduced in our router design, due to the area optimizations applied in the
cross point matrix. From Figure 3.5 we observe that even for an 8× increase in channel
width, the router area increases at the most by 2×.
only 6.1% of the available FPGA device area leaving the remaining logic to efficiently
implement the IPs. Figure 3.6 presents the scaling of logic and routing utilization with
channel width in a mesh network. The linear scaling of routing resources utilized with
increase in channel width demonstrates the suitability of mesh topology for FPGAs.
30
We make use of Xilinx PACE [1] to tightly constrain the critical path and estimate the
post place and route delay. As the routers will be internally connected to the cores,
we need not consider the pad to pad delays while estimating the maximum frequency.
The design is implemented with a Router Functional Module (RFM) [29] wrapper to
estimate the accurate operating frequency. The standalone version of our basic router
can operate at 357 MHz. Therefore, our 8 bits/channel router has a maximum throughput
of 2.85 Gbits/s.
125
100
75
Avg. Latency
cycles
50
25
In our router, the head flit advances to the next node while the remaining flits flow
in a pipelined fashion. In this scheme the network latency depends only on path length
H (number of hops). In the absence of channel contention, the latency of our network L
can be expressed as:
L = 7 × H + B/w (3.1)
31
where, B is the number of bytes in the packet and w is the number of bytes switched
per clock cycle. The factor 7 in the expression is the setup latency incurred at every
router hop in the MoCReS router. This latency is due to the decoding & arbitration
performed based on the header flit in a packet. If Li denotes the latency of the ith packet
in cycles, then the average latency of the common and multiple clock networks can be
expressed as,
PN P N P Hi fj
i=1 Li i=1 j=1 fworst
Lcc avg = and Lmc avg = (3.2)
N N
Where fj and fworst represent the frequencies of the j th router and the slowest router
respectively. And Lmc avg is given in cycles of frequency fworst .
Multi-Clock Experimental Platform: In order to evaluate the performance of the
proposed multi-clock framework, we utilize our VHDL model of the router and simulate a
3×3 mesh. We implement wrapper to generate packets in the local core frequency. Router
frequency values are extracted after the topology synthesis, placement and routing stages.
For varying configurations (resource availability, inter-router distance, bRAM/dRAM
FIFO versions), the router frequency can degrade up to 18% [29]. Figure 3.7 shows the
latency versus injection rate curve for the common clock and multi-clock versions. For
the common clock case, the network frequency was 286 MHz and for the multiple clock
case, the frequency ranged from 357 MHz to 286 MHz The X-axis represents the injection
rate expressed as number of flits injected from every node in one cycle. The Y-axis plots
the measured average latency of packets in each case. It can be seen that the increase in
performance of the proposed framework significantly delays network saturation.
• arbitrates the bRAMs to decide which packet progresses forward in that cycle
• decodes the header flit and sends appropriate request to the central arbiter
The setup latency of our router (to accomplish the above sequence of operations) as
shown in equation 3.1 is 7 cycles. This latency is a constant for all communications in
the network, i.e for every router hop, a latency of 7 cycles is incurred.
However, by utilizing the First Word Follow Through (FWFT) capability available in
Xilinx FIFOs, this latency can be reduced by one cycle. Here, the head flit when pushed
into the FIFO appears at the output bus in the same clock cycle, thereby feeding itself
as input to the bRAM arbiter to initiate the request process to the Central Arbiter. The
L = 6 × H + B/w (3.3)
3.8 Conclusions
We present an area efficient multi clock on-FPGA virtual cut-through router, that has
performance. We use this framework as the baseline design to perform the experiments
presented in the subsequent chapters.
33
The MoCReS framework for FPGA based NoC design is an important contribution of
this thesis. It has low area and high clock rate in comparison to other proposed FPGA
NoCs. However, the performance improvement derived by this NoC is limited by its
strict packet-switched nature. There is a significant performance overhead involved in
converting data to flits, encoding the headers, and serializing them into packets. This
overhead is particularly prominent for IP cores placed close to each other. Overcoming
this overhead is our motivation for designing a novel architecture that is presented in
Chapter 5 which supports an additional time-multiplexed circuit-switched layer for a
4.1 Introduction
With increasing device sizes and capabilities, power dissipation in FPGAs has become a
primary concern. Further, FPGA implementation for portable hand held devices demand
low power consumption to extend the battery life. Therefore, it is essential to attain a
balance between power dissipation and performance in an FPGA based NoC. In this
chapter, we analyze power consumption in our FPGA based implementation of an NoC.
The power dissipated across various components of the NoC is presented. Further, we
discuss the power consumption from the FPGA resource utilization perspective. Once
fully developed, the model could be used to enhance the traditional FPGA design flow
for multiprocessor applications.
Further, we feature a comparison of MoCReS with an alternate approach (LiPaR [11])
in this chapter. The alternate router does not support independent clock frequencies,
and thereby sustains a performance bottleneck. Experimental evidence indicates that the
multi-clock novelty has a marginal power overhead due to the additional clock resource
utilized in the FPGA implementation. Moreover, to understand the impact of power
optimization techniques on the NoC, we determine the power consumption share of NoC
34
35
when a typical image compression application is implemented on our target device and
report the same in this chapter. It is shown experimentally that NoCs in FPGA consume
in ASICs.
Vestias et. al. [32] propose an approach to explore the design space of an SoC imple-
mented with NoC backbone. The authors validate their technique by mapping a JPEG
encoder application and optimizing the design for performance. However, the authors
do not characterize the power consumption in their design. The main drawback of the
router that the authors implemented in [32] is that it utilizes store-and-forward flow
control mechanism. The latency of a packet is very high, as it is directly proportional to
the packet size. Moreover, this implementation is not suitable for power efficient NoCs
because of its high buffer requirements. The buffers, in addition to increasing the area
total power consumption using a typical image processing application and report the
36
In this section we analyze the power comsumed in the MoCReS router architecture
The central component of an NoC is the router and its power consumption has a sig-
nificant impact on the total power consumed by the NoC. The two main components of
FPGA power consumption are, the static power (referred as quiescent power) and the
dynamic power. In spite of current FPGAs supporting upto 65nm, the dynamic power
tends to dominate the total power in FPGAs, due to high operating frequencies (toggle
rates). This trend is in contrary to ASICs, where at current technology, the leakage
power dominates. Therefore it is important for an estimation methodology to account
both the components of power.
The amount of quiescent power is largely dependent on the target device (technology
and preset operating voltages) and the amount of logic utilized by the design. On the
other hand, the dynamic component depends on,
• Transition Activity
• Operating Voltage
Switched capacitance varies with the type of logic/routing resource utilized for the
design. Transition activity of every signal in the design can be expressed in terms
37
of the clock toggle rate. In our target Virtex-4 FPGAs [1], the logic (CLBs), block
RAMs, clock tree and entire routing resources operate under the same voltage source
(VCCIN T = 1.2V ). The Input/Output blocks in our target FPGA operate at a higher
voltage level (2.5V).
For a stand alone router, we estimate the static and dynamic power contributed
by the design. Clock lines in FPGA design contribute to a significant portion of the
dynamic power consumed. We report the standalone power consumption of MoCReS in
Table 4.1. The dynamic power dissipated under 2.5V category is very marginal due to
the Router Functional Module (RFM) that was wrapped around the router for accurate
estimation of power. The RFM restricts the number of Input/Output blocks that will be
utilized by the router. In this section, we also obtain the router models of an alternate
design, LiPaR [11] and compare its power overheads with our modified multi-clock router
framework.
Even though the dynamic component of the standalone router is a small percentage
of the total power, for a large NoC topology implementation with several instances of the
router (and high switching activity), the dynamic component will equally dominate the
total power. Table 4.2 presents the power dissipated in a 3 × 3 mesh implementation
of MoCReS (1VC+CC).
38
In this section, we compare the power consumed by our router architecture with an
alternate router design, LiPaR [11]. Our MoCReS router supports independent clock
frequencies as opposed to LiPaR [11], thereby resulting in increased performance. In
order to evaluate the power trade-offs involved in our design novelty, we compare the
two router designs keeping the following entities a constant:
In addition to the above parameters, the operating frequency for the power experi-
ments was set at 100 MHz which corresponds to the critical path delay of LiPaR.
Standalone Version: Activity data for power estimation (.vcd) is obtained by simulat-
ing the post place & route model of the two designs (with random inputs). Xpower [1]
takes the design description (resource utilization) as a .ncd file along with the .vcd gen-
erated above. Table 4.3 compares the dynamic and quiescent power consumed by the
39
two approaches on the same target FPGA device. Due to fewer resources utilized in
MoCReS, there is a gain in static power and dynamic power contributed by the logic
resources. This is due to the logic optimizations in the cross-point matrix and central
arbiter which led to fewer FPGA resource utilization. Due to increase in dynamic compo-
nent, the power overhead in our approach is marginally (11.53%) more than the alternate
design presented. It is important to note that the equivalent router for LiPaR is the ba-
sic version of MoCReS (1VC+CC), as the LiPaR does not support virtual channels and
operated on a single clock. Table 4.1 presents the power consumed by that version of
MoCReS.
Multi-Clock Feature Our modified MoCReS architecture supports independent clock
frequencies for router instances, thereby allowing the router to function on the high-
est individual clock rate that the placement, routing and switch complexity constraints
dictate. Therefore the number of clock nets utilized could be higher and can cause
an increase in power consumed. With clock lines typically having higher fan-outs, the
switched capacitance and therefore the dynamic power associated with the clock nets
can be significant. Xilinx power estimation framework permits estimating the power
consumed by the clock lines independently for both the designs. It can be seen that
there is only a marginal 11.53% additional power overhead in our MoCReS approach
To effectively characterize the FPGA NoC implementation for power, we determine the
dynamic and quiescent power of each major component in the NoC. The experimental
platform involves an incremental place and route of the NoC. During every stage we
retain all the components of it and stimulate the part of design under investigation. We
extract dynamic and quiescent power for identical activity rates.
In order to better understand the contributions to dynamic power of each of the
router components, we activated parts of the MoCReS design with random input vectors
(following a uniform distribution) and observed the dynamic power consumed. It can
be seen that the buffers of the router consumes highest dynamic power. The results
are in agreement with those extracted for ASICs [30]. It is to be noted that there are
five instances of the Input Port component We believe these results will be useful in
developing future NoC designs targeted for FPGAs with optimized power consumption.
Stimulus is applied to the primary inputs. We use Modelsim 6.1c with TCL scripts
to extract the transition activity of internal nodes. This activity generated for the com-
ponent alone is applied to it after an incremental place and route with remaining com-
ponents contribution eliminated. We use the technique developed by Arole [33] in his
power profiler. We exhaustively analyze power consumed in every resource of the NoC
by this methodology.
Resource Utilization and Dynamic Power: Based on the dynamic power consumed
by every component and its resource utilized, which is extracted using the ncd2xdl,
41
we model the dynamic power across every resource. Table 4.5 compares the resource
utilization between our router and the baseline design. Significant reduction in the
number of nets and routing & logic resources used, contribute to the gain in dynamic
power.
4.4 Conclusions
In this chapter, we discuss the power consumed by our NoC framework on FPGA. We de-
termine the various power components in the standalone design and a 3 × 3 mesh NoC.
Further, to determine the power overhead incurred due to supporting the multi-clock
feature, we compare its power consumption with an alternate design that supports only
one frequency. Results show a marginal 11.53% increase in dynamic power compared to
the baseline approach. Further, we determine the power contributed by various compo-
nents of the router. Results show the buffers to consume majority of power consumed in
the NoC design. An account on the FPGA resources utilized by MoCReS in comparison
with the alternate design is also presented in this chapter.
Chapter 5
5.1 Introduction
The two main concerns with NoC designs that are strictly packet-switched are the control
and serialization overhead involved in transfering data between IP cores that are placed
close to each other in the FPGA. In order to ensure high throughput between these cores,
we advocate time-multiplexed circuit-switched connections. In addition to this mode of
transfer, the router also preserves the online nature of communication between farther
cores through the packet-switched layer. The area efficient MoCReS architecture pre-
sented in Chapter 3 is modified to support both the above mentioned layers of operation.
The design goals and issues involved in the hybrid two-layer architecture are presented
in this chapter. We also develop a SystemC model of our router for both functionally
verifying the design as well as to vary its specifications and obtain the performance re-
sults rapidly through simulation. We present the results and analysis of the novel router
architecture in this chapter.
42
43
5.2 Motivation
and supports packet-switching for inter router transfers and time-multiplexed circuit-
switching for IP cores connected to the same router. This technique also eliminates
the latency in req/grant protocol, serialization and control overheads for data transfers
between cores placed close to each other in FPGAs and mapped to the same router.
In this section, we quantify the overheads associated with the existing baseline approach
(MoCReS). Control and Packetization are the two main overheads associated with the
MoCReS framework.
1. Control Overhead: In MoCReS, connection between various ports are estab-
lished through a req/grant protocol which involves round-robin arbitration in the case
of common ports requests (conflicts). From Chapter 3, we see that it takes at least 6
cycles for the data at the input port to appear at the output of a router (as input to the
downstream router/local IP). This setup latency is a fixed overhead in addition to the
sent over the network must be quantized into flits. Variable number of flits constitute a
packet. If F is the number of flits in a packet and b is the channel width, then F/b is
We target our proposed NoC framework for reconfigurable computing platforms and
therefore we restrict our discussions in this section primarily to existing FPGA based
NoCs. NoCs were introduced into the FPGA domain mainly to simplify tile-based recon-
figuration [4] [5], and its potential as an effective communication architecture is largely
unexplored [34]. Research in [10] [21] address the capabilities of FPGAs to support
NoC based multi-processor applications. Hilton et al. [7] incorporate flexibility into their
design for FPGA based circuit-switched NoCs. However, their strictly circuit-switched
router suffers from signal integrity and path reservation issues which we overcome in our
design. SoCBUS [35] proposes a circuit-switched router with a packet based setup. Here,
control packets are responsible for setting up strict circuit-switched connections, which is
different from our two-layer approach. Research in [36] [7] [6] also present FPGA based
NoCs. The above designs ignore implementation level area-performance trade-offs while
overhead.
In this section, we first present the modified router micro-architecture, followed by its
architectural advantages and design issues involved. The network topology along with the
45
N
NI NI
IP0 IP1
E
W Circuit-Switched Layer
Packet-Switched
Layer
NI NI
IP2 S IP3
flow control for the packet-switched layer are kept the same as presented in Chapter [37].
Network Topology: Mesh networks have minimum area overhead (reduced long lines) [10] [37],
low power consumption and map well to the underlying routing structure of FPGAs.
Hence, we choose a mesh topology to optimize logic and routing in FPGAs, and to
provide sufficient resources for the IP cores.
Flow Control: Our router supports multi-clock virtual cut-through flow control with a
deadlock-free XY routing. The switch complexity involved in the above choice is more
suitable for a light-weight implementation [37].
switch participate in the C-layer, thereby achieving guaranteed throughput and more
predictable latencies between IP cores placed close to each other in the FPGA.
Figure 5.1 presents the novel two-layer hybrid router architecture. This modified
router has four local IP ports, in addition to the four directional ports. Further, in
this case two of the four local IPs (IP 0,IP 3) are participating in the time-multiplexed
circuit-switched layer. Using the packet-switched layer, all the four IPs can communicate
are introduced to support these additional local ports. However, all the connections
between the local ports in this layer are removed, as they are connected in the circuit-
switched layer. The ports connected through the C-Layer (IP 0,IP 3) cannot participate
in the P-Layer to transfer data between themselves. This translates into gain in area
this cross-point can handle a maximum of Pi high throughput parallel connections. The
scheduling memory configures this cross-point during various time slots.
Router Channel Widths: Due to high throughput requirement between the cores
participating in the circuit-switched layer, we set the channel width to 32 bits (corre-
sponding to the data width of microblaze soft processor). In the packet-switched layer,
we retain the bus width of MoCReS (8 bits/channel). However, choice of an appropriate
channel width is a trade off between resources available and bandwidth required.
47
3
When L0_Last_E =>
3 3 3 3 3
Msel_L0 Msel_L3 Msel_N Msel_E Msel_S Msel_W
When L0_Last_S =>
Grnt_in_W
Req_in_L0
Grnt_in_L0
Req_in_S
Req_in_L3
Req_in_N
Req_in_E
Grnt_in_S
Grnt_in_L3
Grnt_in_N
Grnt_in_E
The Central Arbiter is responsible for configuring the simultaneous connections by setting
the cross-point in the P-Layer. We run parallel FSMs to ensure that no queing takes place
between requests. As long as the participating IPs request mutually exclusive ports, the
connections happen parallely. In case of queing/conflicts, the arbitration is performed
through the round robin approach. The IPs that participate in the C-Layer will not
need arbitration between themselves in the P-Layer. We perform state reduction in the
FSMs corresponding to those inter-local port connections .i.e in correspondence with the
inter local IP connections that are removed (Section 5.4.1) in the packet-switched layer.
The Central Arbiter is also customized to not support states for these connections. The
simplicity of round-robin arbitration coupled with the above state reduction translates
into significant area savings. Figure 5.2 shows the modified central arbiter model.
48
5.4.3 NI Design
The network interface arbitrates the choice of packet/circuit switched layer and is also
responsible for supporting variable size packets.
Mode Switching: Upon receiving the target IP co-ordinates, it triggers the mode signal
to decide if the packet will be decoded to leave the router or the cross point is triggered
2. X co-ordinate of destination IP
3. Y co-ordinate of destination IP
The packets transfered through the network can be broadly classified as control (lesser
number of flits) or data. Therefore, the packets will be of varied sizes. The NI encodes
the packet size as a fraction of the total bRAM depth along with the header. This novelty
improves buffer utilization, thereby increasing the performance of the NoC.
In order to quickly explore the NoC design space, we have parameterized the structural
2. Channel width
3. Virtual Channels/port
average packet latency is also reduced [36]. Therefore dynamic power drops considerably
with reduction of router hops.
Guaranteed Throughput: The time-multiplexed nature of the C-Layer scheduling
provides good Quality of Service (QoS) to the application, particularly, between cores
placed close to each other. Otherwise, the NoC would have to support area expensive
also optimizes the area required for storing the schedules (with fewer bits required to
encode the configuration data of the circuit-switched network).
50
Setup Latency
P_CLK
N_in 00 10 A2 33 F2
E_in 00 12 A2 63 62
S_in 00 0A 42 23 22
W_in 00 1A 52 33 32
L2_in 00 04 72 53 52
S_out 00 10 A2 33 F2
L2_out 00 12 A2 63 62
W_out 00 04 72 53 52
E_out 00 1A 52 33 32
N_out 00 0A 42 23 22
C_CLK
0000
L0_C 0000 A01A 1011 A054 814B xxxx 7054 810B 9910 xxxx DF54 614B 7071 D054
L1_C 0000 A01A 1011 A054 814B xxxx DF54 614B 7071 D054
0000
L3_C 0000
0000 7054 810B 9910 xxxx DF54 614B 7071 D054
Multi-Cast
operation
50 ns 100 ns
With increasing design complexities, there is a need for rapid design space exploration
that makes use of a set of specifications. We model our NoC router framework using
SystemC. By doing so, we functionally verify the model as well as setup a platform to
estimate the advantages of this architecture over the baseline approach.
SystemC is a description language that abstracts the computation elements of a design
by behaviors (or processes) and simplifies the communication between the cores using
transaction level modelling. The framework has a set of library routines and macros
implemented using C++. The behavior of the hardware to be modeled is captured by
simulating concurrent processes coded in C++.
SystemC Tool Flow: Every component in the router is modeled in C++ as a process.
This .cpp file can be compiled and executed with the SystemC engine that is written in
C++. We use the opensource SystemC version 2.1 to compile our router design. The
set of .cpp files are first compiled with the appropriate command options. Then, an
51
600
Area (Slices)
400
200
0
3
2 8
#C
ircu 6
it s/ 1 rts
wP
orts
4
t s/w Po
0 2 #P acke
• Applied to standard simulation tool for verifying the functionality of the model by
In this section we present the Area/Synthesis results for our modified router implemented
on Xilinx Virtex 4 [1].
The additional bandwidth offered by the proposed router comes with an increase
in switch complexity. The amount of FPGA logic and routing resources consumed by
the router instance depends on its complexity. Figure 5.4 presents this variation in
switch area with the number of ports (C & P-Layer) it supports. Further, the operating
frequency of the router instances vary greatly due to different critical path lengths.
Also, with increasing number of ports participating in the circuit-switched layer, the
routing resources deplete rapidly (due to increased channel widths). This degradation in
52
200
0
3 8
6
2
# Cir Ports
cuit 1 4 e t s/w
s/w P ack
orts 0 2 #P
Table 5.1: Scaling of Area and Frequency with No.of C-Layer Ports
MoClib Component Area (Slices) Frequency (MHz)
MC (4,2,2) 314 336
MC (5,3,2) 326 318
MC (5,2,3) 341 303
MC (6,3,3) 394 240
MC (6,2,4) 382 258
MC (7,3,4) 440 221
performance in turn affects the bandwidth the switch can offer. Figure 5.5 presents the
variation in switch operating frequency with the number of ports in both layers. The
above area and frequency estimates are obtained by varying the parameters in the VHDL
model of the router and by implementing them on the target device.
Furthermore, to perform automatic topology synthesis, we estimate the increase/decrease
in switch area with exclusive variations in number of P-Layer ports and C-Layer ports
independently. When NoC area is in the cost function, the above data will aid rapid
design space exploration. Tables 5.1 and 5.2 present the scaling of area & frequency
with increasing C-Layer and P-Layer ports respectively. In the tables, MC(x,y,z) denote
an instance of the M oClib library, where y is the total number of C-Layer ports, z is
the total number of P-Layer ports and x is the sum of the two (total number of ports).
Table 5.2 presents the scaling of area and frequency only with respect to the P-Layer
ports and therefore they can be considered as variations of the MoCReS baseline router.
53
Table 5.2: Scaling of Area and Frequency with No.of P-Layer Ports
MoClib Component Area (Slices) Frequency (MHz)
MC (3,0,3) 296 378
MC (4,0,4) 318 362
MC (5,0,5) 349 324
MC (6,0,6) 390 296
MC (7,0,7) 435 267
MC (8,0,8) 493 229
number of ports it supports. We measure the area values for increasing number of ports
(packet-switched) in the baseline version. For similar area values, when the alternate
hybrid router is used, there is an increase in available bandwidth per port. This band-
width increase associated with the hybrid router architecture is compared in this section
with the baseline approach. For equivalent area overheads (in slices) on a similar FPGA,
Figure 5.6 presents the bandwidth capacity (in MB/s) of the NoC (per port) for both
approaches. In spite of a rapid degradation in operating frequency (with increase in
circuit-switched ports), there is a significant bandwidth gain using the hybrid two-layer
approach. For the area window utilized in our library of routers, there is an average
20.4% gain in bandwidth (maximum of 24%) offered by our NoC. This gain in perfor-
mance is due to supporting a high throughput circuit-switched layer with a marginal
area overhead.
Even though it appears intuitively that an increase in number of ports in the C-layer
gives performance benefits without any area overhead, there are certain design issues
that can potentially limit the performance due to increase in switch complexity.
54
800 25
700 20
% Gain in BW
600 15
500 10
sources associated with an increase in switch complexity (number of ports, bus width).
As a result, the operating frequency of the switch degrades which in turn affects the
bandwidth offered by the router. For the NoC paradigm to efficiently be an alterna-
tive to the bus-based architecture, the performance design parameters must be chosen
carefully so that it is possible to operate the routers at the highest possible frequency.
Switch Power vs Link Power: By increasing the number of ports, we can reduce
the average hop count [36], i.e we minimize the routers and links. This translates into
a reduction in power consumed by the links, but an increase in power consumed by the
switches. Beyond a cut-off, the increase in switch power can potentially overshadow the
gain in link power, thereby it can increase the power/flit ratio.
Explosion of Schedule Memory: With increasing number of C-layer ports, the
schedule memory also scales linearly. The schedule memory, expressed in number of
LUTs is a function of number of schedule cycles and C-layer ports present. If C is the
number of ports participating in the C-Layer, then dlog2 Ce is the number of configuration
55
rate.
It can be seen that all of the above factors limit the amount of performance gain that
can be achieved using our hybrid approach. This trade-off between performance, area
and port count merits a balance and requires an application-suitable tuning of the NoC
topology. We present an algorithm along with a CAD flow in Chapter 8 to automate
5.9 Conclusions
In this chapter, we present the limitations associated with the MoCReS packet switched
NoC and then design and implement a hybrid two-layer router architecture for FPGA
based NoCs. We functionally verify the design and characterize several versions of the
novel router for area and operating frequency. We also present the bandwidth results
along with the design advantages and issues involved in the proposed architecture.
Chapter 6
In this chapter we analyze the novel router architecture presented in Chapter 5 for
performance and power. Our MoCReS router design in Chapter 3 is utilized as the
baseline router in making the performance and power comparisons. We retain the Virtex-
4, XC4VLX100 [1] device as the target for all the comparisons.
The main advantage behind the hybrid approach used in the router architecture is in
offering increased overall throughput. The C-layer connections are pre-scheduled between
IP cores that require high bandwidth. These short distance high bandwidth connections
come with a less resource penalty in FPGAs. In this section, we quantify the average
improvement in performance compared to the baseline approach.
56
57
R6 R5 R4
R1 R2 R3
We have instantiated a 3 × 2 mesh network (Figure 6.1) with the baseline MoCReS
router. Furthermore, six packet injecting modules were wrapped around the mesh frame-
work. The framework adopted from [27] uses a C++ module which generates an input
file that contains all the packets (input vectors) that the network needs to transport.
Input parameters to the C++ program are, the mesh co-ordinates, number of packets
to be generated and the length of each packet. In every generated packet, the first flit
contains the destination IP X and Y co-ordinates and packet size (fraction). The VHDL
testbench wrapped around the mesh directly controls the injection rate at all IP input
from the perl tool. While generating the packets, it is ensured that the source and des-
tination of the packet are never the same. The rest of the testbench reads the generated
text files and injects the packet into the source port. While doing so, the packet injection
timestamp is also recorded. When the same packet is received at the destination after
a finite number of clock cycles, the testbench also marks the time stamp. Therefore,
upon successful completion of simulation, the VHDL testbench creates two files for every
IP. One with the injection timestamp of every packet and other with the received time
The experimental flow developed in [27] determines the total execution time, wherein
the largest timestamp of the received packet is reported. We modify the flow to compute
the latencies of every packet injected into the network and finally determine the average
latency of the baseline & modified network for that particular injection rate. We increase
the injection rate in steps of 0.1 flits/node/cycle and observe the increase in average
latency of the two networks.
59
For both the baseline and hybrid router approaches, we apply a combination of two
traffic scenarios: random traffic and hot spot traffic.
• Random Traffic: The source-destination pairs and the number of packets are com-
pletely random values that follow a uniform distribution. Once generated, the same
traffic is applied for both the baseline MoCReS and hybrid mesh framework.
• Hot Spot Traffic: Along with the above random approach, we forcefully choose
source-destination pairs such that a majority of the transfers occur between one or
two IPs. We manually perform this task to create hot spots in the traffic. This
could be a common scenario in a SoC, where critical components such as memories
have a significant % of overall packets transfers.
Figure 6.3 presents source-destination pairs for all the packets injected into the net-
work. As mentioned before, a packet is never routed back to the source IP. The figure
presents the number of packets sent into the five possible destinations for each source IP
(IP1 to IP6).
60
1000
900
Hybrid Router Mesh
Baseline MoCReS
800
700
Average Packet Latency
600
500
400
300
200
100
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
# Flits/Node/Cycle
Experimental results are presented in Figure 6.4. As the injection was increased from
0.1, the average latency in both cases increases linearly as well. However, it can be seen
that the network saturates for the baseline approach around the 0.65 flits/node/cycle
mark. In the hybrid router approach, the network sustains linear average latency until
0.75 flits/node/cycle which is a significant improvement in a small mesh network that
consists of 6 cores. The above saturation point will considerably vary with the number
of IPs (dimension of mesh).
1. Increased bandwidth between two pairs of cores that reduces the traffic burden
from the network.
2. The reduction in average number of packet hops due to reduced number of routers
compared to baseline approach. Mapping of IPs have a significant impact on the
61
total number of packet hops. Compared to the baseline case, the total number of
packet hops reduced by 42.8% in the hybrid mesh due to the routers that support
In the above analysis we have simplified the latency analysis by neglecting the zero
load setup time for a C-Layer connection. Our hybrid router incurs a 2 clock cycle penalty
for every C-layer schedule memory look up and cross-point connection. Furthermore, in
the above hybrid router mesh, utmost only two ports participate in the C-Layer. In
case of routers that have several C-layer ports with connection between C-layer ports
changing more frequently, this penalty needs to be included while comparing with the
baseline mesh.
This section presents the detailed power trade-offs involved in the proposed router ar-
chitecture. We first present the amount of total dynamic power consumed in our hybrid
router along with the power breakdown across its components. Further, we also present
how the above power metric scales with the number of C- and P- Layer ports present
in a router. Furthermore, power dissipated in our hybrid router mesh is compared with
the baseline router. Before determining the above power numbers, we floorplan the NoC
in our target FPGA and then place and route the design to obtain accurate resource
utilization estimates.
For this analysis, we consider an 8 port hybrid router (4 directional + 4 local) with
3 IPs participating in the C-Layer. The C-Layer components are the Schedule FSM,
bRAM Schedule memory and the C-Layer CPM (C-CPM). We apply random inputs
to the router and determine the switching activity of the placed and routed model.
62
Using XPower[1], we obtain dynamic and static power estimates for our router. Towards
obtaining a component based power estimate, we adopt the power profiler methodology
(with incremental synthesis) used in [33]. We apply 5000 flits with random requests
to the hybrid router and capture the applied transition activity in Table 6.1. The net
dynamic power consumed by this router is 134.87mW at 200 MHZ operating frequency.
Figure 6.5 presents the power breakdown of the 8 port hybrid router.
The power consumed/packet data for each router varies based on its complexity.
The complexity of the router translates into the amount of logic and routing resources
(switched capacitance) consumed by it. Figure 6.6 presents the scaling of dynamic power
(mW) with increasing P-Layer ports. With increasing P-Layer ports, the Cross-Point
and Central Arbiter scale linearly in terms of amount of logic utilized, thereby increasing
the dynamic power linearly.
63
180
Dynamic Power Vs Switch Size
160
120
100
80
60
40
2 3 4 5 6 7 8 9 10
P−Layer Ports
With increasing C-Layer connections, the amount of long interconnects utilized within
the router increases, thereby increasing the amount of switched capacitance. For equiva-
lent routers (in terms of no. ports), higher C-Layer connections can increase the dynamic
power consumed by up to 18%. Figure 6.7 presents dynamic power scaling in three cases
with varying number of directional ports. In this analysis we fixed the size of schedule
memory at 16 words with 8 bits/word. We have configured the schedule FSM to take as
input the source IP (2 bits), destination IP (2 bits) and the number of clock cycles (4
bits) in binary representation, thereby reducing the number of words required to store a
schedule. However, as the number of ports or configurations increase, the schedule mem-
ory needs to expand. In the above case the power fraction contributed by the schedule
memory will increase.
The two components of power consumed in an NoC are the switch power and link power.
The switch power is determined by the cross point size (number of ports), arbitration
and routing logic used. Fewer number of ports/router implies reduced switch complexity
64
135
2 - C-Layer Ports
130
Dynamic Power (mW) at 200 MHz
Baseline MoCReS Mesh
Power (mW)
550
Hybrid Router Mesh
500
450
400
350
300
250
200
150
and increased number of links. In our router designs, we have employed the simple XY
routing with round robin arbitration. This leads to a marginal scaling in switch power
with router ports. Further, with increasing interconnect sizes (farther cores), the link
power begins to dominate the overall power consumed in the NoC. In our hybrid router
architecture, we use C-layer connections only for IP cores placed close to each other. This
incurs a marginal penalty in link power compared to long bus based connections. We
present the impact of floorplanning on power and performance of the on-chip network at
a greater depth in Chapter 9.4
Figure 6.8 presents a comparison between the amount of total power consumed in the
hybrid two-layer router and the baseline version. The NoC supports 6 IPs with the same
network topologies in Figures 6.1 and 6.2. Due to reduced number of interconnects and
logic resources in the hybrid router, there is a gain of about 15.38% in dynamic power
and 10.1% in static power consumed.
Chapter 7
Experimental Platform
In this chapter we present the experimental platform developed to evaluate the ap-
proach presented in this thesis. First, a description of the multi-processor benchmarks
To determine the impact of our approach on typical designs, we utilize a set of real and
synthetic benchmarks that are widely used in NoC studies [36] [38]. The benchmarks are
represented by task graphs. We use the following graph model in which communication
• E(i, j) and BW (i, j) denote the directional edges (communication pattern) and
corresponding bandwidth requirement between IP core i and j
These task graphs can represent many real applications, including MPEG, DFT etc.
They can also be easily generated for a comprehensive study of the proposed approach.
66
67
class of applications permit traffic characterization early in the design cycle. This infor-
mation is used to fine-tune the NoC topology. Each of the edges are annotated with the
required bandwidth in Mega Bytes/second (MB/s).
2. Synthetic Benchmarks: In addition to these real application benchmarks, we obtain
a rich set of synthetic benchmarks that were generated using Task Graphs For Free
(TGFF) [40] in [36]. These benchmark cases sustain a rich variety of communication
properties, namely, in-degree, out-degree and dependence width and therefore represent a
wider class of multi-processor applications. For these synthetic benchmarks, we randomly
generate bandwidth requirements that follow a uniform distribution and use them to
packed nature of the benchmarks. The maximum bandwidth requirement denotes the
opportunities for clustering (violation), while the minimum bandwidth requirements
represents the potential to reduce area (by increasing the number of packet switched
ports). The clustering and area reduction approach are discussed in Chapter 8
68
VOPD 12 14 2 2 500 16
MWD 12 13 2 2 128 64
LU Decomposition 9 11 2 3 510 76
Laplace Solver 9 12 2 2 378 68
Synthetic
Basic -1 9 8 1 4 196 34
Parallel -1 9 14 3 4 225 47
Packed -1 9 16 3 5 334 59
Packed -2 9 15 3 4 412 106
For the router design and implementation, we follow the Xilinx ISE [1] flow. The various
versions of the router are modeled in structural VHDL. We incorporate the design opti-
mizations using the VHDL models. The optimized model is then synthesized, mapped,
placed and routed using Xilinx ISE. The design is targeted for an XC4VLX100-11 on
a nallatech [3] BenDAT AT M development board. Figure 7.1 presents a Virtex-4 [1]
platform FPGA on a BenDATA [3] development board. Xilinx XST [1] which is a part
of ISE 8.2i is used to synthesize the VHDL models. Xilinx LogiCORE FIFO Genera-
tor v2.3 is used to generate common clock/independent clock FIFO buffers. Functional
simulation of the router and mesh versions are performed using Modelsim 6.3c [28]
An NoC router implemented on FPGA consumes certain existing logic and routing re-
sources. Placement and routing constraints and switch complexity dictate the operating
69
ize this variation in router properties with the availability of resources, we develop an
experimental platform.
XDL Flow: Xilinx Design Language (XDL) is a standard to transform the native circuit
description (.ncd) of the placed and routed design. The XDL utility which is a part of
ISE suite is used in our research to the convert the .ncd file of the router design to .xdl
(using xdl -ncd2xdl command). Later, this .xdl file is parsed to extract the resource
information. The XDL file contains two parts:
1. Resource Instances
2. Net Description
Resource Instances: This section of the XDL file contains instances of components uti-
lized in the target device, including LUTs, bRAMs, and embedded blocks if any.
Net Description: This part consists of a detailed description of every net instances in
the design, implemented on the FPGA. The description includes source pin, destination
pin(s), fan-out, routing resources utilized by the design. Figure 7.2 illustrates portions
of a sample XDL file for for our router.
Programmable routing resources in FPGA consume a significant portion of circuit
power & delay. An efficient NoC implementation will take into account these factors
of the underlying FPGA architecture. Routing resources in our target Virtex-4 FPGA
can be divided into: 1) Single (OMUX/IMUX) Lines, 2) Double Lines, 3) Hex lines, 4)
Long Lines, 5) Clock Tree, and 6) Programmable Interconnect Points (PIPs). Figure 7.3
illustrates the main types of routing resources in the target FPGA device.
The interconnect distance, switched capacitance of the above resource significantly
vary, thereby accounting for a range of delay and power values. We estimate the delay (ns)
and power (mw) overheads involved in utilizing these resources in the design. Table 7.2
71
cfg "
net "E_channel_data_in_3_IBUF" ,
outpin "E_channel_data_in <3>" ,
inpin "input_E/fifoA/BU2/U0/BU7"
,
pip BRAM_X10Y132 IMUX_B19_INT0 > RAMB16_DIA3
pip INT_X0Y134 BEST_LOGIC_OUTS0 > E6BEG8
pip INT_X10Y132 S2END8 > IMUXB19
pip INT_X10Y134 W2END6 > S2BEG8
pip INT_X12Y134 E6END6 > W2BEG6
pip INT_X6Y134 E6END8 > E6BEG6
pip IOIS_NC_L_X0Y134 IOIS_IO > BEST_LOGIC_OUTS0_INT ,
pip IOIS_NC_L_X0Y134 IOIS_IBUF0 > IOIS_IBUF_PINWIRE0 ,
pip IOIS_NC_L_X0Y134 IOIS_IBUF_PINWIRE0 > IOIS_I0 ,
;
Single Line
Double Line
Hex Line
Router Version Nets Hex Double Long Single Slices Max.Freq (MHz)
MoCReS (1VC+MC) 902 439 1402 5 5633 282 303
summarizes the performance and power results of the interconnects. The performance
estimates are produced by re-routing a specific net across various resources using Xilinx
FPGA Editor [1]. For power measurements, the operating frequency is set at 200 MHz
with an operating voltage of 1.2v corresponding to the programmable interconnects in
XC4VLX100. Xpower utility is used to determine the dynamic power contributed by
that particular net that is re-routed.
As a part of this research, we develop a Perl utility to parse the XDL file to obtain
the number of nets, and type of routing/logic resources utilized in the design. This
information is presented in Table 7.3. This version of MoCReS consumed 282 Virtex-4
slices and operated at 303 MHz. Figure 7.4 shows the % of each type of routing resource
used by the design. It can be seen that the direct lines (IMUX/OMUX) contribute to
75% of the total routing resources utilized by the design. The above design is tightly
constrained in terms of area, and therefore the longer lines utilization is less.
Based on the placement and routing constraints present in the FPGA and the switch
complexity (number of ports, etc.), there will be varied number of configurations for every
router (based on the available FPGA resources). We consider three such configurations
for a 5-port MoCReS router and present them below. We used Xilinx PACE tool [1] to
74
HEX
6%
Double
19%
Long
<1%
Single
75%
floorplan and set the area constraints to the router design for every configuration.
Configuration A: This router is closely packed with a square shape and with a max-
imum area constraint on it. Further, the FIFO bRAMs are placed close to the switch.
The above constraints rapidly reduce the utilization of high capacitance routing resources
and achieves a router design with maximum operating frequency (303 MHz). However,
traditional CAD tools avoid such a configuration due to excessive depletion of routing
resources in the constrained area. This leads to a low performance in the user logic that
surrounds the router. Figure 7.5 presents this heavily constrained configuration.
Configuration B: In some cases, user logic (IPs) might be prioritized for CLBs present
near bRAMs. In such cases, the network component (router) must be implemented
farther from the bRAM FIFOs. As a result, there is an increase in critical path in the
router design leading to lower operating frequency (286 MHz in this case). Figure 7.5
also presents this configuration of the router.
Configuration C: Finally, we capture the effect of inter router distances in FPGA
based NoC by means of this configuration. Due to varied core sizes, it is possible that
75
Configuration A Configuration B
the network elements (routers) get placed and routed farther apart, thereby increasing
the delay between the output port of a router to the input buffer of the downstream
router. This increase will have an impact on the operating frequency of the upstream
router. In our experiments, the critical path increased by 0.802 ns for an increase in
inter-router distance increase by 6 CLBs. The above increase in critical path leads to
deterioration of operating frequency of the router.
The NoC topology synthesis tool presented in Chapter 8 is implemented primarily using
C++ language and is supported by Perl scripts. Perl is used for benchmark processing
and mesh topology generation while the C++ tool executes the computationally intensive
operations. Data structures implemented in C++ use the Standard Template Library
Edges in the task graph represent communication requirements between cores. There are
MB/s, due to the simplicity of the model and the quick execution time which allows us
to perform many experiments.
77
Hardware Description
gcc* .vhd
.vcd output Area/Freq
Models
Functional
Simulation
Synthesis vsim*
.vcd
Map
ISE Power Analysis
.ncd
Design Flow Place & Route xpower*
.bit
Platform FPGA
8.1 Introduction
Our router architecture can support a host of design parameters including, channel link
width (b), number of ports (Pi ), number of virtual channnels/port (v) and number of lo-
cal cores participating in the circuit-switched layer (Li ). Certain domain of applications
provide static communication traces early in the design cycle, i.e these classes of appli-
cations permit traffic characterization at an early stage. We utilize this information to
customize the NoC topology overlayed on FPGAs. Due to a large number of parameters,
it is fundamentaly impossible to instantiate and hand tune the communication topology
In our approach, we break this cycle by manipulating the number of router instances and
number of ports in each router. This chapter also presents the various phases of NoC
topology design in detail.
78
79
Our
Approach
# C-Layer Ports
# P-Layer Ports # Routers
Available BW
MB/s
While ASIC implementations have a well developed CAD flow for NoC design [41] [42],
there is no automated methodology for FPGAs that takes into account the features of
the underlying architecture. Moreover, the limitation in resources, higher power con-
sumption, and increasing heterogenity of the FPGA device complicates the design flow.
Research in [5] is the first work to address automated design for FPGAs. Their underly-
ing NoC model enables fast performance verification and is less suitable for supporting
high performance NoC in FPGAs, as opposed to our characterized M oClib NoC library
that we have developed in this research.
To the best of our knowledge, this is the first work to propose an FPGA-suitable
that satisfies the bandwidth requirements of an application while optimizing its area
overhead.
Given a task graph G(V, E), where each vi ∈ V represents an IP core, and directed
edge eij = {vi , vj } ∈ E denotes a communication edge with a bandwidth weight function
bij : Eij → R, Find a mapping G(V, E) → C(V, E), where C represents a mesh topol-
ogy graph, such that, ∀i, j ∈ V the available topology bandwidth meets the required
P P
bandwidth ∀eij ∈E bij with minimum NoC area, ∀i∈C <i .
In this section, we present the four important phases in our flow, a detailed description
of the synthesis algorithm, its functioning and complexity.
The input to the algorithm consists of a core communication graph (G), annotated
with bandwidth requirement between modules. Further, the design space parameters,
namely, the maximum bandwidth supported in a packet-switched link (critical band-
width bc ), available FPGA area (Aav ) along with area and performance models of our
router architecture are also provided as input. Figure 8.2 shows our topology synthesis
framework. Our algorithm supports four main operations in the phases mentioned below:
1. Clustering
2. Mesh Generation
4. Area Optimization
The above four phases are presented in detail with graphical examples.
81
MoClib
Library 1
50 100
.vhd
Area/Perf 2 3
models 90
90 90 75
4 5
Exhaustive IP
Mapping
XY Routing
BWcritical
Clustering
C max
Link Capacity Required BW
Estimation Estimation
Optimize NoC
Area
Output
NoC Topology
Topology Synthesis
BW = 500 MB/s
1 critical 1,3
Clustering
2 3 2,5 3
100
200
150 150
600 200
100
4 6 4 6
5 5
G G'
8.4.1 Clustering
During the clustering phase, the edges in the input task graph (G) whose required band-
width violates the critical bandwidth (bc ) are identified. The packet switched NoC frame-
work does not have sufficient link capacity to support these communications. Therefore,
we utilize the hybrid router architecture that has enough bandwidth available between
specific cores. The cores requiring these bandwidths that exceed the available inter-
router capacity are grouped to form clusters of multiple IPs connected via the C-layer of
a single router. Upon completion, this phase outputs the clustered core graph (G’), the
upper bound (U’) on the number of routers and the information about the types of net-
work components chosen from the MoCReS Library. Therefore, in the new graph G’, the
reduces the operating frequency of the router. Therefore, cores must be clustered judi-
ciously to avoid degrading overall performance. Table 8.1 presents the clustering results
from this phase for the chosen benchmarks.
83
During the Mesh Generation phase, we generate all mesh topologies with U’ routers. Due
to its suitability for FPGAs, we consider only mesh based topologies in this research.
From U’, we determine all its factors and identify all possible mesh topologies. During
this topology generation step, we preserve the clustered nature of the cores, output from
the previous phase. Of all these possible meshes, an appropriate topology that satisfies
During this phase, the operations performed are, the exhaustive IP mapping and the link
bandwidth estimation.
optimize it for area during the last phase. If a valid mapping is present for that particular
topology, our exhaustive search algorithm is guaranteed to output the mapping, as the
84
bandwidth required for all (source,destination) pairs in the clustered core graph meets the
available topology bandwidth. Our choice of XY routing simplifies this phase in addition
to reducing the switch complexity due to its simple logic. The cumulative bandwidth
requirement on each edge (contributed by each source to destination route) establishes
the required link capacity constraint. For a given MPEG4 application as a task graph,
Figure 8.4 presents a candidate topology and its cumulative link bandwidth requirement
(cost) in MB/s. As a result of Task 1 mapped to router (0,0), the link connecting the
router to (1,0) requires a 1912 MB/s bandwidth (equal to the sum of bandwidths between
Task 1 and rest of the tasks). To select a candidate topology, we conservatively estimate
the available link bandwidth that the communication architecture supports. This is
determined from the link width and the operating frequency of the router. During the
Link Bandwidth Estimation operation, the router models from the M oClib library (<)
are input to estimate the available link capacities for the router configurations chosen in
the NoC topology by the above phases. This process involves estimating the bandwidth
available between routers operating over different frequencies. For instance, a bulky
router present in a high communication path, severely degrades the total performance of
the NoC. The candidate topology selection phase incorporates these trade-offs using the
router models.
The primary objective behind this algorithm is to synthesize NoC topologies for FPGA
based designs that satisfy the required bandwidth with a minimum area overhead. We
estimate this area overhead in terms of Virtex-4 [1] FPGA slices. The routers contribute
to the area utilization of an NoC. The area utilized by the router varies with its configu-
ration (number of C-layer & P-Layer ports, buffering and channel width). For a chosen
85
100 0
0,1 1,1 2,1 380 2
2 1 3
4
0,0 1,0 2,0 8 5
1912 1513
346 1340
Router(x,y) IP Core(s)
10
0,0 1,2 1,0 5 2,0 10
500 1000
0,1 3 1,1 6,9 2,1 11
0,2 4,7 1,2 8 2,2 12 11 12
topology that has information on the configurations of routers used, we determine the to-
tal area by summing the individual slices by looking them up from the M oClib library of
NoC components. To summarize, upon determining the candidate topology that satisfies
the bandwidth, we conservatively estimate the area required by the chosen topology.
During the Area Optimization phase, the required number of router components (U’)
is decreased iteratively by one. In each iteration, we prune the NoC topology by removing
a router with a single IP connected to it. In order to balance the total number of IPs
with the local ports of the routers, we perform the following in order:
• Very marginal area increase compared to the gain obtained by removing one router.
In the area optimization phase, we first perform the above operation and determine
if the bandwidth requirements are still met. However, if the above step introduces
violations for all combinations of mesh and IP mappings, we substitute that chosen
router with an alternate configuration that supports increased C-Layer ports instead.
Reducing the number of router by above fashion also minimizes the average hop count
of the network, leading to improved execution time. However, our primary objective is
only to ensure that the bandwidth requirements are met with a minimum area NoC. The
new NoC topology is then input back to the Mesh Generation and Candidate Topology
Selection phases to determine (exhaustively) if the bandwidth requirements are met.
The four phases described in Section 8.4 are presented in the form of a pseudocode in
the following algorithm. Input to the algorithm consists of the core task graph, G(V,E),
with |V | Cores, |E| Edges, along with the bandwidth values annotated. The M oClib
component library values, critical bandwidth (bc ) are also input to the algorithm.
The Clustering phase (lines 1-5) involves iterating over all the edges to determine the
bandwidth violations. The output of this operation is the clustered core graph, G’(V,E)
and the upper bound on the number of routers (U’). Based on the factors of this upper
bound, all possible mesh topologies are generated (lines 6-7) and output to perform
candidate topology selection (lines 8-13). As mentioned in Section 8.4.3, this operation
can be partitioned into two sub-operations: IP Mapping (lines 8-9) and Link Bandwidth
Estimation (lines 11-13). Finally, optimizing area (lines 15-21) involves decrementing U’
and determining if the new topology satisfies the required bandwidth. The terminating
conditions to the iterations are U 0 = 1 and when Pmax , which is the maximum number of
ports in routers for the suggested NoC topology, exceeds Pcritical (maximum supported
number of ports by the library).
88
We analyze the time complexity of our algorithm in this section and present the execution
time results for a set of chosen benchmarks. With respect to the type of computation
performed, the algorithm presented in Section 8.5 can be divided into the following
phases,
1. Clustering
2. Mesh Generation
4. Area Optimization
The Clustering phase presents a time complexity of
(|E|), where |E| is the number
of edges in the task graph. For the design sizes considered in this research, the above
phase contributes only to a negligible portion of the total execution time. Based on
the determined router upper bound U’, the next phase, Mesh Generation outputs all
possible mesh configurations. In terms of complexity, this step is linear to the number
of vertices, |V |, therefore having a time complexity of the order of
(|V |). During the
Candidate Topology Selection phase, the operations performed are, the exhaustive IP
mapping and the link bandwidth estimation. The worst case time complexity of both
the phases can be expressed as, (
(U 0 !) +
(|E|))× (# mesh configurations). Finally,
the Area Optimization phase also has a time complexity of the order of
(|V |). Of the
above four phases, the candidate topology selection phase dominates the computational
complexity of the algorithm due to its exhaustive nature. Even though the IP mapping
design space is factorial, we exit early from the exhaustive search once the first valid
mapping is found. i.e we do not optimize the CAD Flow for performance. As a result,
it will be shown in Section 8.6 that the typical execution times are much less compared
to the worst case complexity.
90
VHDL and use Xilinx ISE 8.2i [1] to follow the FPGA design flow for the router models.
Algorithm 6.1 is implemented in C++ using Standard Template Library (STL) data
structures and is supported with Perl for benchmark processing. We execute the above
algorithm on a AMD Opteron Processor with Linux, operating at 2.4 GHz and having
3GB RAM on our chosen benchmarks and report the results in Table 8.3.
The execution time of the algorithm for a benchmark is directly related to the time
complexity of the mesh generation and IP mapping phase. With an exception of one
benchmark (FFT, with 15 cores), the average execution time was around 8 minutes.
91
12 11 1 2 3 9 5 3 3 1 2
10
6 5 7 8 6 1 7 6 4 5
5 8
4 10 11 8 9
9 8 4 7 4 2
3 1 2
10 12 13 14 15
12 11 10
7 6
11 12
In order to determine the impact of the proposed algorithm on area, we compare our
results in this chapter with the solution provided by the baseline NoC described in
Chapter 3. This traditional multi-clock NoC has one IP attached to every router and
our technique on four widely used application benchmarks, (FFT, MPEG4, VOPD and
MWD) [39] and six synthetic benchmarks [36] that represent a variety of communication
patterns that are frequenty encountered in multi-processor designs.
Using our hybrid architecture and integrated design flow, results were obtained for
92
various benchmarks. For similar bandwidth constraints applied through task graph edges,
Figure 8.6 compares the synthesized topology area between the proposed and baseline
approaches. With the number of cores in the benchmarks varying between 6 and 15, it
can be seen that there is an average reduction of 21.6% (maximum of 26%) in the NoC
area which can be used for efficient implementation of application logic. The bandwidth
constraints were translated into the original design and estimation of area was performed
in slices. It is to be noted that the CAD tool does not optimize the design for execution
time. However, ensuring that the required bandwidth is satisfied is the primary goal. For
all of the application benchmarks, our approach was able to obtain alternate topologies
utilizing our hybrid router library with fewer FPGA resources.
8.8 Conclusions
cycle by integrating the router with an algorithm that optimizes for FPGA area while
satisfying the required bandwidth. Experimental results for a set of real applications and
synthetic benchmarks show an average reduction of 21.6% in FPGA area (maximum of
26%) for equivalent bandwidth constraints when compared with a baseline approach.
93
MPEG4
FFT
VOPD
MWD
Benchmarks
Lu
Laplace
Basic - 1
% Area Savings
Hybrid Router
Parallel - 1
Baseline (MoCReS)
Packed - 1
Packed - 2
0
2000 2500 3000 3500 4000 4500 5000
In this chapter we present the design methodology that we have developed for implement-
ing a complete SoC application using our on-chip network backbone. The methodology
• IP Core Characterization
• NoC Floorplanning
Upon presenting the motivation behind our methodology, we will then address our
contribution in each of its parts listed above at a greater depth. In this chapter, we
also present a case study comprising of a multi-processor Image Compression Applica-
tion, wherein we obtain real VHDL cores and perform comparisons between alternate
implementations in our target FPGA.
94
95
9.1 Motivation
With increasing device capacities and design sizes, ITRS 2007 [43] advocates high design
re-use as the solution paradigm. As shown in Figure 9.1, the percentage of the whole
design that is constructed from re-used IPs is expected to increase steadily in the next
several years. Figure 9.1 also presents the increasing future trend for the percentage
of reconfigurable components in future designs. The above phenomenon along with
increasing time-to-market constraints motivates the standardization of IP cores in FPGA
based designs.
kinds of IPs, we restrict ourselves to soft IP cores in this discussion. We assume the Soft
IPs to have the following properties:
• Data transfers to/from the IP block take place through a finite number of buses
with large data widths (32 bits for example).
Xilinx MicroBlaze [1] offers soft configurable IP cores for software implementation of
the design. This feature adds tremendous flexibility to the application implemented in
FPGAs. The alternative hard processors available in recent FPGA devices are called
Power PC hard processors. These processors offer very high performance while they are
limited in number. The soft microblaze IPs as opposed to the embedded processors must
be implemented in the configurable logic of FPGA, thereby competing for resources with
the other parts of the design. However, as opposed to the hard embedded procssors, tens
of these microblaze cores can be implemented using present FPGA devices. Figure 9.2
presents the architecture of microblaze IPs. Some of the main features of soft IPs in
Xilinx FPGAs can be classified into - Computation based features and Communication
based features:
97
Computation Based:
Communication Based:
• Flexible and Efficient Processor Local Bus (PLB) or On-chip Peripheral Bus (OPB)
• Upto 16 FSLs each 32 bits (Fast Simplex Links) for interfacing external modules
With the above logic and interface support, microblaze IPs can be efficiently imple-
mented in a multi-processor SoC. The standardized interfaces of the IP readily supports
the NoC paradigm for communication. Especially the 16 FSL links available to intercon-
nect external co-processors or other computation modules can serve as the interface for
the NoC.
98
We have designed the On-Chip network and the network interface keeping in mind
these communication requirements. For example, multiple FSL links (of size 32 bits)
emerging out of the microblaze cores could be interfaced using our multi-module cus-
tomizable NI to the network back bone. Certain IP communications might be less time
critical. In those cases, the data could be packetized and transmitted over the P-Layer
of the router, thereby achieving tremendous parallelism in the applications. On the
other hand, time critical data communication requiring predictable latencies could be
pre-scheduled through the C-layer of the NoC.
In addition to the above soft cores, Xilinx CORE Generator [1] provides a rich set
up IP cores optimized for Xilinx FPGAs. The kinds of IPs they provide span from Au-
dio, Video and Image processing to Automotive Industry applications to FPGA specific
storage elements. However, Xilinx does not automatically synthesize a suitable commu-
nication architecture for the IPs. We advocate our NoC based framework for this IP
based design environment. In the next section we obtain a set of freely available soft IP
cores [44] and customize them for our NoC-NI framework and also present the overheads
We obtain a set of publicly available cores [44] and develop an IP library that will be
compatible with our Network Interface and NoC. These application IPs can serve as
individual cores of a wide variety of multi-processor applications.
Later in this chapter we present the area characterization of the library of IPs we
consider in this study. As mentioned before, each of the IPs will be wrapped by a
suitable instance of the Network Interface that will be described in the next section.
Further, in subsequent sections we study the area and power overheads incurred by our
NI Wrapper
IP Core
Computation
Data/Ctrl Bus
Data/Ctrl Bus
Unit
Upto 4 IP Module
Connections
The modified two-layer router architecture presented in Chapter 5 supports high through-
put intra router connections (C-Layer) in addition to the packet switched online routing
layer (P-Layer). This hybrid router sustains a high average bandwidth per port thereby
is to standardize the external communication of the IP core, thereby hiding the imple-
mentation details of the interconnect. In this section, we first present the design goals
behind this network interface and then describe its compatibility with our IP abstraction.
This work was carried out in collaboration with another student. See his thesis [45] for
As mentioned above, data transfers can take place to/from the core through a finite
number of buses. Towards designing a customizable Network Interface to this generic
core abstraction, we keep an upper bound of 4 for the number entry/exit points for the
IP (called IP Modules).
• Customizablility
• Low Area
Hybrid router compatibility: Being compatible with our library of hybrid two-
layer routers was our most important design goal. The NI must be able to support data
transfers between variable number of IP cores through the Circuit Switched Layer as well
as the Packet Switched Layer. If required, the NI needs to resolve operating frequency
differences between the communicating IPs.
Customizability: The RTL description of the NI needs to support certain important
design parameters. These parameters allow seamless integration of the NI with a library
of IP cores. The main parameters of the NI are,
• FIFO Depths
• Configuration Modes
101
Key 2 (0 : 63)
Key 3 (0 : 63)
Triple DES IP Core
Data (0 : 63)
NI Wrapper
The IP cores presented in previous sections needs to be interfaced with our NoC. For do-
ing so, we customize the NI to suit the variable requirements of every IP. Upon preparing
the IPs for our NoC framework, we characterize them for their area and power overheads.
The results are presented in Figure 9.5. It can be seen that the average area increase
due to the NI overhead is 18.82% while there is an average 10.37% increase in dynamic
Quantizer I 12, 8
1175 1367 16.34 33.08 37.81 14.30
0 8
able as a part of the existing Xilinx Design flow for FPGA implementation. The objective
behind this tool is to support hierarchical place and route (from the synthesized design),
thereby improving overall performance of the design.
Advantages: 1) This CAD enhancement can be used to identify and move critical
blocks in the design 2) Direct place and route based on Pin & Memory constraints and
hardly featured as a priority in traditional FPGA design flow. However, the current de-
sign complexities and increasing need for portable hand held applications are motivating
power aware CAD enhancements to the design flow.
Until now floorplanning in the design flow (Planahead) only emphasizes on manual
area constraints placed on multiple cores and the design takes a lot of time to converge,
which we believe as the number of cores increase would be almost infeasible. Figure 9.7
presents the area constraints applied for a 3 × 2 mesh NoC. Upon completion of place
and route, the resource congestion of the mesh is shown in Figure 9.8.
In this research we have pre-characterized the network components and obtained
delay and power models for its components. When this communication architecture
knowledge is used to floorplan the multi-processor application and NoC, it could lead to
rapid timing violation removal and design convergence. Furthermore, while proposing
a CAD enhancement it needs to be ensured that existing industry level design flow for
FPGA implementation are only marginally varied.
105
buffering is performed only at the input ports. Therefore, the inter router links appear in
the critical path of the design. Also, with increasing link lengths and distances between
routers (increased switched capacitance), the link power tends to dominate the overall
power consumption. While implementing an NoC with predictable power-performance
metrics, the above factors needs to be considered. We present the routing resource
characterization results in Figure 9.9. Xilinx FPGA Editor [1] is used to vary the routing
between various points in the NoC and the delay variations are measured in ns and
dynamic power is measured at 200 MHz clock frequency.
It can be seen that with every inter router connection (NoC Link) that spans 4 CLBs,
delay could increase upto 0.5 ns (2 × 0.25). As the link delay appears in the critical path,
a router operating at 250 MHz could suffer a performance degradation of upto 12.5%,
which is significant. Through efficient floorplanning and choice of appropriate frequencies
106
a router. When the number of ports are increased, theoretically the performance of
the NoC could increase while keeping the area overhead low. However, the floorplan
perspective presented above needs to be considered while estimating the actual benefits
behind routers that support increased number of IPs.
Image processing applications are in general computation intensive due to the need for
processing several million pixels for operating with reasonable resolutions. Further, these
applications have portability and constrained time-to-market requirements and there-
fore merit FPGAs as the target architecture. NoC based multi-media communication
107
architectures are expected to highly competitive alternatives to its existing bus based
counterparts. The main reason for this is the natural division of these applications into
designs, the freely available cores that we utilize to construct our experiments serve to
significantly reduce the design and test time. As the multi-processor domain of digital
design continues to evolve, more sophisticated applications could be implemented using
our NoC framework for FPGAs.
Application Description: The name JPEG stands for Joint Photographic Experts
Group. It specifies the way in which an input image is transformed/compressed and
stored. The JPEG conversion is performed through a set of IPs that convert the raw
input data format into jpeg standard. It is performed through multiple stages from
RGB to YCrCb conversion and DCT and huffman encoding and Run Length Encoding
(RLE) and finally stored in the memory. Traditionally, a color conversion from RGB
space to Ycrcb space is performed before storage and transmission. This leads to drastic
bandwidth reduction sustaining a marginal quality trade-off. The application is designed
to accept 352 × 288 pixel images in bitmap format as input and output in JPEG format.
Figure 9.10 presents the JPEG application that we have implemented in this case study.
108
The Network Interface described in the previous section is customized for implementing
the IPs of the image compression application. The objective behind this experiment is to
implement a typical SoC application on an FPGA along with the network components
and characterize the NoC design for area, performance and power. Present industry
applications operating at very high frequencies and having various highly complex pro-
cessing units certainly merit these sophisticated communication architectures. However,
as we could not obtain these industry level benchmarks for performing our experiments,
we restrict ourselves to freely available cores from Opencores [44].
The network channel width is set at 16 bits per flit. We assume each packet to
consist of a block with 8 × 8 pixels, each represented using 16 bits. Therefore, an
entire packet could consist of 64 flits with each having 16 bits to constitute one block
information independently. To study the area and power overheads of the NoC in a
typical application scenario, we implement various NoC topology versions. Figure 9.11
presents the various configurations that we have synthesized in order to study the NoC
overheads. We have retained a mesh topology for all the experiments as we already have
a library of router components that supports its flow control. Furthermore, all the multi-
port routers utilized in implementing this application sustain only the P-Layer. We make
this decision keeping in mind the marginal bandwidth requirements of this application.
In this section we present detailed area and power overheads of the NoC design. Fig-
ure 9.12 presents the comparison between these alternate implementations. The four
configurations shown in this figure are the same as those shown in Figure 9.11. The
last column presents results from a flat synthesis implementation of the whole applica-
tion. Upon enforcing tight timing constraints, the maximum operating frequency of the
109
RGB
_ > YCrCb
Run Length
Encoder
(a) (b)
DCT
Router Router Router Router Router Router
with the flat synthesis approach as the amount of parallelism inherently present in the
application is very marginal. Our experimental results mainly serve to demonstrate the
trade-off in area and power of the NoC present while choosing an appropriate implemen-
tation methodology.
Area and Power Overhead: Within the limited available logic and routing re-
sources, the FPGAs need to contain the user logic along with the communication archi-
111
tecture. During all phases of our design we have accomplished the design goals with as
fewer resources as possible. Routers being the central component of the network, are
being replicated multiple times to interface all the IPs to the network. In this research
we determine the area overhead of the NoC as a % of total slices utilized by the applica-
tion and NoC. To accurately estimate the logic, we: 1) floorplan the design manually 2)
place large constraints on the area and 3) obtain the place and route results. Figure 9.12
shows that the overall area utilization of the image compression application gradually
increases with the number of routers. Also, the % overhead of NoC increases and reaches
11.23% for the 8-router topology (configuration d). As the number of cores and amount
of parallelism scales, the performance and power benefits obtained through this approach
is expected to outweigh the area overhead seen above. Figure 9.12 also presents power
overheads for various configurations. For the purposes of comparing power consumption
between alternate implementations, we consider only the dynamic component. It can
be seen that with variations in number of routers, dynamic power follows a different
trend compared to the area overhead. Compared with configuration (a), the two-router
version sustains lower dynamic power due to reduced switch power. Even though there
is a marginal increase in area, the reduced switching activity in the routers leads to
lower dynamic power. The average power overhead in the on-chip network across all
configurations was around 18% for the JPEG application.
112
Flat
Configuration (a) (b) (c) (d)
Synthesis
Total Area
9947 10158 10336 10743 9546
(Slices)
Total Dyn.
337.19 322.42 335.64 367.85 274.77
Power (mW)
%NoC Dyn.
17.74 16.86 17.28 19.11
Power
NoC/Design
108 MHz 186 MHz 242 MHz 273 MHz 86 MHz
FMax
RGB
RLE
In the following sub sections, we briefly summarize the contributions made in this dis-
sertation. Furthermore, we have suggested directions for future work beyond the contri-
10.1 Contributions
113
114
Systems was developed. The design addresses area, performance and multi-clock ca-
pability which are the primary design goals in NoC design for FPGAs. Our 5-port
virtual cut-through router has an area overhead of only 282 Virtex-4 slices (a marginal
0.57% of XC4VLX100) and operates at 357 MHz supporting a competitive data rate of
2.85 Gbit/s.
We determine the power consumption of the NoC framework on FPGA. Various com-
ponents of power consumed were presented in detail. Further, we analyze the power
trade-offs associated with our design novelties by comparing it with a baseline approach,
implemented on the same target device. Results show a marginal dynamic power over-
head (11.5%) for the performance advantage observed in our multi-clock NoC design.
Further, we associate the power consumed by various components in the NoC architec-
ture to the underlying FPGA resources utilized by them.
traditional NoC).
115
oughly characterize the novel router architecture and its library of network components
for power consumed.
A CAD tool has been developed for implementing the design flow for FPGA NoC Topol-
ogy Synthesis. It implements an exhaustive search algorithm with multiple phases that
optimizes the area of the NoC while meeting the bandwidth requirements of the applica-
tion. The M oClib NoC component library is used to perform the NoC topology design
space exploration. For any given specific application as a task graph, our integrated
synthesis framework determines a suitable NoC topology that satisfies the bandwidth
requirements, while optimizing for the area overhead. We report the results for a wide
set of application and synthetic benchmarks, represented as task graphs. Results show
an average reduction of 21.6% in FPGA NoC area (maximum of 26%) for equivalent
bandwidth constraints when compared with a baseline approach.
NoC compatible IPs were utilized to implement an image compression application using
116
FPGA based NoCs. We evaluate the area and power overhead involved in our alternate
implementation methodology.
In this section we outline the future research directions that could serve as an extension
of our work.
FPGA CAD Enhancement: As the number of processing elements within a SoC
increases, the design time tends to increase due to the manual effort required. Therefore,
it is important to automate floorplanning of the on-chip network and IPs, taking into
consideration the constraints of the application. Real IP cores have their independent
pin, bRAM and logic/routing resource requirements. We have modeled the resource,
performance and power overheads of our NoC and IP library in our research. An effi-
cient CAD flow that takes into consideration these resource requirements needs to be
developed.
Multi-Processor Benchmarking: Standardization of multi-processor benchmarks
would greatly help future research in enhancing FPGA based NoCs. It is certain that the
semiconductor industry is driving towards more and more processing elements. However,
It is important to benchmark these SoC applications that have high inherent parallelism
in them. This would model the realistic traffic scenarios and congestion that are typical
to FPGAs. Furthermore, similar to bus-based architectures, IP cores developed in future
could be standardized for NoC centric communication.
FPGA Device Support: In this research we have considered NoCs suitable for
implementing on FPGAs. Overlaying on-chip networks on an existing FPGA device
offers tremendous flexibility at the loss of performance. Future heterogeneous FPGAs
could incorporate hardware support for on-chip networks and thereby increase the overall
partition the NoC components over hard embedded and soft configurable FPGA blocks.
Bibliography
[5] A.Kumar et al. An FPGA Design Flow for Reconfigurable Network-Based Multi-
[7] Clint Hilton and Brent Nelson. PNoC: a flexible circuit-switched NoC for FPGA
based systems. In IEEE Proc. Computers and Digital Techniques, 2006.
118
119
[10] Manuel Saldaa, Lesley Shannon, and Paul Chow. The Routability of Multiprocessor
Network Topologies in FPGAs. In SLIP’06, pages 49–56, 2006.
[11] Balasubramanian Sethuraman, Prasun Bhattacharya, Jawad Khan, and Ranga Ve-
muri. LiPaR: A Light-Weight Parallel Router for FPGA-based Networks-on-Chip.
In Great Lakes Symposium on VLSI, 2005.
[12] William J. Dally and Brian Towles. Route Packets, Not Wires: On-Chip Intercon-
[13] Luca Benini and Giovanni De Micheli. Network on Chips: A New SOC Paradigm.
In IEEE Computer, 2002.
[16] J.A.Kahle et al. Introduction to the cell multiprocessor. In IBM Journal of Research
and Development, 2005.
[21] T.A Bartic et. al. Topology Adaptive Network-on-Chip Design and Implementation.
In Computer and Digital Tecniques, IEE Proceedings, pages 467–472, 2005.
[24] Fernando Moraes et al. A Low Area Overhead Packet-switched Network on Chip:
Architecture and Prototyping. In IFIP VLSI-SOC, 2003.
[26] Daewook Kim, Manho Kim, and Gerald E.T Sobelman. Asynchronous FIFO Inter-
faces for GALS On-Chip Switched Networks. In Intl. SoC Design Conference’2005,
pages 186–189, 2005.
[27] Vijay Swaminathan. Performance analysis of multi-clock noc for fpgas. Master’s
thesis, University of Cincinnati, 2007.
[29] Prasun Bhattacharya. Comparison of Single-Port and Multi-Port NoCs with Con-
temporary Buses on FPGAs. Master’s thesis, University of Cincinnati, 2006.
[30] N. Banerjee, P. Vellanki, and K. S. Chatha. A Power and Performance Model for
Network-on-Chip Architectures. In DATE 04: Proceedings of the conference on
Design, automation and test in Europe, 2004.
121
[32] Mrio P. Vstias and Horcio C. Neto. Co-Synthesis of a Configurable SoC Platform
based on a Network on Chip Architecture. In ASPDAC, 2006.
[33] Alukayode Arole. Power profiling: An incremental power analysis technique for
fpga-based designs. Master’s thesis, University of Cincinnati, 2006.
[34] T.S.T. Mak et.al. On-FPGA Communication Architectures and Design Factors. In
FPL’06, 2006.
[35] D. Wiklund and L.Dake. SoCBUS: switched network on chip for hard real time
embedded systems. In Parallel and Distributed Processing Symposium, 2003, 2003.
[36] Balasubramanian Sethuraman and Ranga Vemuri. optiMap: a tool for automated
generation of NoC architectures using multi-port routers for FPGAs. In Design,
Automation and Test in Europe, 2006. DATE ’06, 2006.
[37] A.Janarthanan et.al. MoCReS: an Area-Efficient Multi Clock On-Chip Network for
[38] T. Lei and S. Kumar. A two-step Genetic Algorithm for Mapping Task Graphs to a
Network on Chip Architecture. In Euromicro Symposium on Digital System Design,
2003, 2003.
[40] R.P.Dick et. al. TGFF: Task Graphs for Free. In 6th International Workshop on
Hardware/Software Codesign, 1998.
122
[41] Davide Bertozzi et. al. NoC Synthesis Fow for Customized Domain Specific Multi-
processor Systems-on-Chip. In IEEE Transaction on Parallel and Distributed Sys-
tems, 2005.
[42] K.Srinivasan and K.Chatha. A low complexity heuristic for design of custom
network-on-chip architectures. In DATE 2006, 2006.
[44] http://www.opencores.org.
[46] R.Gindin, I.Cidon, and I.Keidar. NoC-Based FPGA: Architecture and Routing. In
NOCS 2007, 2007.
2007.
[48] N.Kavaldjiev, G.J.M.Smit, and P.G.Jansen. A Virtual Channel Router for On-Chip
Networks. In SOC Conference, 2004.
[49] N.Kavaldjiev, G.J.M.Smit, and P.G.Jansen. Two Architectures for On-Chip Virtual
Channel Router. In PROGRESS Symposium on Embedded Systems, 2004.
[51] C.A. Zeferino, M.E. Kreutz, and A.A Susin. RASoC: a router soft-core for networks-
on-chip. In DATE’2004-Designer’s Forum, IEEE CS Press, 2004, pages 198–203,
2004.
123
[52] William J. Dally and Brian Towles. Principles and Practices of Interconnection
Networks. Morgan Kaufmann, 2003.
[53] K.Srinivasan and K.Chatha. A technique for low energy mapping and routing in
network-on-chip architectures. In ISPELD, 2005.
[56] S. Murali and G. D. Micheli. SUNMAP: A Tool for Automatic Topology Selection
and Generation for NoCs. In In Proceedings of the ACM/IEEE Design Automation
Conference, 2004.
[60] R.Gindin, I.Cidon, and I.Keidar. Noc-based fpga: Architecture and routing. In
International Symposium on Networks-on-Chip, 2007.
[61] A.Mello et. al. Virtual channels in networks on chip: Implementation and evaluation
on hermes noc. In SBCCI’05, 2005.
124
[62] I.Kuon and J.Rose. Measuring the Gap Between FPGAs and ASICs. In FPGA’06,
2006.
[64] A.Reimer, A.Schulz, and W.Nebel. Modelling Macromodules for High-Level Dy-
namic Power Estimation of FPGA-based Digital Designs. In ISPELD’06, 2006.
[66] J.Xu, W.Wolf, J.Henkel, and S.Chakradhar. A design methodology for application-
specific networks-on-chip. In ACM Transactions on Embedded Computing Systems,
2006.
[67] T.Bartic et.al. Network-on-Chip for Reconfigurable Systems: From High-Level De-
sign Down to Implementation. In FPL’04, 2004.
[69] P.Vstias and H.Heto. Area and Performance Optimization of a Generic Network-
on-Chip Architecture. In SBCCI’06, 2006.
[72] U.Y.Ogras and R.Marculescu. Its a Small World After All: NoC Performance
Optimization Via Long-Range Link Insertion. In IEEE Transactions on VLSI, 2006.
125
[74] A.Kumar et. al. Express Virtual Channels: Towards the Ideal Interconnection
Fabric. In ISCA’07, 2007.
[75] E. Rijpkema et. al. Trade offs in the design of a router with both guaranteed and
best-effort services for networks on chip. In DATE’03, 2003.
[76] S. Stergiou et. al. A synthesis oriented design library for networks on chips. In
DATE’05, 2005.
[77] S. Murali et.al. Bandwidth Constrained Mapping of Cores onto NoC Architectures.
In DATE’04, 2004.
[78] H.S. Wang et. al. Orion: A Power-Performance Simulator for Interconnection Net-
works. In Microarchitecture’02, 2002.
[79] U.Y.Ogras, J. Hu, and R.Marculescu. Key research problems in NoC design: a
holistic perspective. In International Workshop on Hardware/Software Codesign,
2005.
2006.
[82] M.Wang, A.Ranjan, and S.Raje. Multi-million gate fpga physical design challenges.
In ICCAD’03, 2003.
[83] J.Liu, L.Zheng, and H.Tenhunen. Global Routing for Multicast-Supporting TDM
Network-on-Chip. In SOC’04, 2004.
126
[84] S.Murali et. al. Designing Application Specific Networks on Chips with Floorplan
Information. In ICCAD’06, 2006.
[85] K.Poon, A.Yan, and J.E.Wilton. A Flexible Power Model for FPGAs. In FPL’02,
2002.
[87] D.Wu, B.M.Hashimi, and M.T.Schmitz. Improving Routing Efficiency for Network-
on-Chip through Content-Aware Input Selection. In ASPDAC’06, 2006.
[88] Tong Li. Estimation of Power Consumption in Wormhole Routed Networks on Chip.
Master’s thesis, IMIT/LECS Stockholm, Sweden, 2005.
[89] K.Paulsson, M.Hubner, and J.Becker. Online Optimization of FPGA Power Dissipa-
tion by Exploiting Runtime Adaptation of Communication Primitives. In SBCCI’06,
2006.
[93] A.Laffely, J.Liang, P.Jain, W.Burleson, and R. Tessier. Adaptive systems on a chip
(aSoC) for low-power signal processing. In IEEE Signals, Systems and Computers,
2001.
127
2007.
[95] Z.Lu, M.Liu, and A.Jantsch. Layered Switching for Networks on Chip. In DAC
2007, 2007.