0% found this document useful (0 votes)

60 views

(Tutorial) NoC The Next Generation of Multi-Processor SoC

NoC the Next Generation of Multi-Processor SoC

Uploaded by

Muhammad Refaat

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views

(Tutorial) NoC The Next Generation of Multi-Processor SoC

NoC the Next Generation of Multi-Processor SoC

Uploaded by

Muhammad Refaat

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 135

EAIT, 2011

N t Network-on-Chip k Chi
The Next Generation of Multi-Processor System-on-Chip
Presenters

Dr. Santanu Chattopadhyay Associate Professor

Santanu Kundu Research Scholar

Dept. p of Electronics and Electrical Communication Engineering g g Indian Institute of Technology, Kharagpur.
18th Feb, 2011

email: {santanu, skundu}@ece.iitkgp.ernet.in

Lecture 1

Introduction

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Introduction I t d ti
End Node
Device SW Interface HW Interface

End Node
Device SW Interface HW Interface

End Node

Device SW Interface HW Interface

After mass market production of dual-core and quad-core processor chips, the trend towards Multi-Core Multi Core processing is now a well established one. In multi-core multi core processing, multiple processor (i.e. CPU, DSP) along with multiple computer components (i.e. microcontroller, memory blocks, timers, etc.) are integrated onto a single silicon chip. chip This architecture is often called as Multi-Processor System-on-Chip (MPSoC) (MPSoC).

Link

Communication Medium

Link

Architecture overview of Multi-Processor System-on-Chip

Link

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Introduction

System-on-Chip (SoC)
Each on chip component referred t as Intellectual to I t ll t l Property P t (IP) block. The communication medium used in modern multi-processor chips is bus based. Upto tens of cores in a single chip, the performance of these bus based chips are satisfactory. But beyond that its performance degrade with number of cores attached.

The communication backbone used in modern SoC is shared bus.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Limitation of Shared Global Bus

Communication Bottleneck: A shared bus allows only one communication at a time, time and even in a hierarchical bus, bus a single communication can block all buses of the hierarchy. Scalability: Bus based SoC does not scale with the system size and its bandwidth is shared by all the systems attached to it.
Node Node Node

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Limitation of Shared Global Bus

The intrinsic parasitic resistance and d capacitance it can be b quite it high hi h for a long bus line. The global bus delay increases exponentially with decrease in process technology. Every E er additional IP block adds to parasitic capacitance and causes increased propagation delay. In deep sub-micron era, 80% or more of the delay of critical paths will be due to global interconnects.

Relative Evolution of wire and gate delays

Reference: International Technology Roadmap for Semiconductor (ITRS) Documents (2003), Available at: http://public.itrs.net/Files/2003ITRS/Home2003.htm.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Shared Global Bus to Segmented Bus

R R R R

Segmented Bus

Multi-Level Segmented Bus

Shared global bus is segmented by inserting repeaters (R). In segmented bus, delay increases linearly with decrease in process technology . No improvement p in bandwidth as it is still shared by y all the cores attached to it. At the system level, it has a profound effect in changing the focus from computation to communication. communication

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Point-to-Point Point to Point Dedicated Links

0 7 1

5 4

Advantage: Bandwidth is higher than the shared bus. Drawback: Switch size increases with increase in number of cores. Number of links needed increases exponentially as the number of cores increases. More number of metal layers are required in placement and routing.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Centralized Crossbar Switch

Components:
Node
Crossbar Switch

Node

Crossbar switch and Point-to-point links. Advantage: A crossbar switch enhance the scalability to some extent. Drawback: However, connecting large number of cores with a single g switch is not very effective as it is not ultimately scalable and, , thus, , it is an intermediate solution.

Node

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Network-on-Chip: A Paradigm Shift

Off-Chip vs. On-Chip Networks
o Th The bandwidth b d id h of f off-chip ff hi networks k i is typically much lower than on-chip networks. o Off-chip network is often affected by clock skew whereas clock skew problem is less significant g for on-chip p networks.

Only 3 components 1. Network Interface (NI) 2. Switch (Router) 3 Point-to-Point Links 3.

o Off-chip networks has higher latency than their on-chip counter part. o Area is not a strong constraint for off-chip networks, but for on-chip network it is one of the major constraint. constraint

Reference: Benini, L. and Micheli, G.D. (2002) Network on chips: a new SOC paradigm, IEEE Computer, Vol. 35, No. 1, pp.7078.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Layers of Abstraction in Network-on-Chip

Session Layer - NoC Abstraction (Open Core Protocol Standardization) Transport Layer - Network Interface Network Layer - Router / Switch Data Link Layer - Flow Control Protocol - Error Handling Physical Layer - Physical Wire Connection

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

SoC to NoC: An Evolution

SoC
Bandwidth is limited, shared Speed goes down as N grows Central arbitration No layers of abstraction However: Fairly simple.

NoC
Aggregate bandwidth grows Speed unaffected by N Distributed arbitration Separate abstraction layers However: Complex architecture.

NoC

SoC

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Design Goal of Network-on-Chip

High throughput Low latency S l bl architecture Scalable hi Less energy consumption Smaller area requirements R li bili i Reliability in C Communication. i i Quality-of-Service Support

Lecture 2 Architecture Design and Performance Evaluation of Network-on-Chip

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Design Issues in Network-on-Chip

Switching Techniques Topology Selection Routing Flow Control Protocol & GALS Implementation Buffering Arbitration

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Switching S it hi Techniques T h i
Circuit Ci it S Switching it hi
Buffers for request tokens

Source end node

Destination end node

Request for circuit establishment (routing and arbitration is performed during this step)

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Circuit Ci it Switching S it hi
Buffers for ack tokens

Source end node Request for circuit establishment

Destination end node

Acknowledgment and circuit establishment (as token travels back to the source, source connections are established)

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Circuit Ci it Switching S it hi

Source end node Request for circuit establishment Acknowledgment andcircuit establishment Message transport (neither routing ( g nor arbitration is required) q )

Destination end node

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Circuit Ci it Switching S it hi

X
Source end node Acknowledgment andcircuit establishment Packet transport High contention,low utilization low throughput Destination end node

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Switching Techniques
Store-and-forward Packet switching
Packets are completely stored before any portion is forwarded Buffers for data packets

Store

Drawback: 1. Larger Buffer 2 More 2. M Latency L

Source end node

Destination end node

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Switching Techniques
Store-and-forward Packet switching
Packets are completely stored before any portion is forwarded Latency per router depends on the size of the packet
Requirement: buffers must be sized to hold entire packet

Store Forward

Drawback:
1. 2 2. Larger Buffer, M r Latency More L t n

Source end node

Destination end node

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Switching Techniques
Virtual Cut-Through Packet Switching
Packets completely stored at the switch
Requirement: buffers must be sized to hold entire packet

Drawback:
Busy Link

L Larger B Buffer ff Advantage: Lesser Latency y

Source end node

Destination end node

Latency/ y/ router reduced by y forwarding g header flit of a packet p as soon as space p for the entire packet in the next router.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Switching Techniques
Wormhole Packet Switching
Advantage: Lower Buffer Space, Lesser Latency. Dra back: Throughput Drawback: Thro ghp t lesser than Virtual Virt al Cut C t Through Thro gh Requirement: R i packets can be larger than buffers

Busy Link

Source end node

Packets stored along the switch

Destination end node

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Network Interface (NI) Module

Protocol Conversion Clock Domain Shifting

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Network Interface (NI) Module

packet (64x32)bits

Fli i i Flitization
Header eop bop GT/BE Src_add Dest _add (32-bit) Payload 1 eop bop GT/BE (32-bit) y 2 eop Payload bop GT/BE (32-bit) Tailer eop bop GT/BE (32-bit) (32 bit) DATA 1 DATA2

DATA n

1 Packet = 64 Flits 1 Flit = 32 bits

Deflitization
(64x32)bits ) packet ( p

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Design Issues in Network-on-Chip

Switching Techniques Topology Selection Routing Flow Control Protocol & GALS Implementation Buffering Arbitration

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Topology Selection
Diameter

Number N b of f Links Li k

A topology with large number of links can support high Maximum shortest p path distance between two nodes in bandwidth. bandwidth the network. Networks with small diameters are preferable. Average Distance Average Distance is the average among the distances between all pairs of nodes of a graph. A topology having lesser average distance is preferable.

Bisection Width
Minimum number of wires removed in order to bisect a network. A larger bisection width enables faster information exchange, and preferable.

Node Degree
Numbers of channels connecting the node to its neighbors. g The lower this number, , the easier to build the network.

2D Mesh with 16 cores Topology p gy selection is application dependent.

Reference: Interconnection Network Architectures (2001) pp.2649, Available at: www.wellesley.edu/cs/ courses/cs331/notes/notesnetworks.pdf

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Existing Topologies in NoC

2D Mesh
2D m mesh of 16 c cores

All switches are connected to the four closest other switches and target resource block via two opposite unidirectional links, except those switches on the edge of the layout.

For MN Mesh, Di Diameter: t (M + N - 2) Bisection Width: min (M, N) No. of routers required: (M * N) Node Degree: 3 (corner), (corner) 4 (edge), 5 (central). CLICH: Chip-Level Integration of Communicating g Heterogeneous g Elements

Reference: Kumar, S., Jantsch, A., Soininen, J. P., Forsell, M., Millberg, M., Oberg, J., Tiensyrja, K. and Hemani, A. (2002) A network on chip architecture and design methodology, Proc. of. ISVLSI, pp.117124.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Existing Topologies in NoC

2D Torus
2D T Torus of 16 c cores

Wires are wrapped pp around from the top component to the bottom and rightmost to leftmost. leftmost For MN Torus,
Diameter: M/2 + N/2 Bisection Width: 2 * min (M, N) No of routers required: (M * N) No. Node Degree: 5

Disadvantage: The long end-around connections can yield excessive delays. delays
Reference: Dally, W. J. and Towles, B. (2001) Route packets, not wires: on-chip interconnection networks, Proceedings of the 38th Design Automation Conference (DAC 2001), pp.684689.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Existing Topologies in NoC

Solving Delay Problem of Torus

Reducing the maximum i physical link length

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Existing Topologies in NoC

Folded Torus
2D Folded ed Torus of 1 16 cores

Reference: Dally, W.J. and Seitz, C.L. (1986) The torus routing chip, Journal of Distributed Computing, Vol. 1, No. 4, pp.187196.

2DTo orus of 16 co ores

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Existing Topologies in NoC

Octagon
For a network having N number of IP bl k blocks, Diameter: 2 * N/8. Drawback: D b k For a system consisting of more than eight nodes, the network is extended to multidimensional space.
Wiring complexity increases linearly

2DOctago on of 8 cores s

with number of nodes.

Reference: Karim, F., Nguyen, A. and Dey, S. (2002) An interconnect architecture for networking systems on chips, IEEE Micro, Vol. 22, No. 5, pp.3645.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Existing Topologies in NoC

Binary Tree
2DBinary Tree o of 16 cores

A binary tree-based network with N (power of 2) number of IP core has, Diameter: log2 N Bisection Width: 1 No of Routers required: (N/2 1) No. Node Degree: 5 (leaf), 3 (stem), 2 (root)

Dr b k Bisection Drawback: Bi ti n Width i is very r l less. Advantage: Lesser Diameter.

Reference: Jeang, Y. L., Huang, W. H. and Fang, W. F. (2004) A binary tree architecture for application specific network on chip (ASNOC) design, IEEE Asia-Pacific Conference on Circuits and Systems, pp.877880.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Existing Topologies in NoC

Fat Tree
2DFat Tree o 2 of 16 cores

Every level has same number switches. The functional IP blocks reside at the leaves and the switches reside at the vertices. For N number of IP blocks, the network has, Diameter: log2 N/4 Bisection Width: N/2 No. of Routers required: (N. log2 N)/8 Node Degree: 8 (non-root node), 4 (root node).

SPIN: Scalable, Programmable, Integrated Network Advantage:

Large Bisection Width, Smaller Diameter

Drawback : High Node Degree

Reference: Guerrier, P. and Greiner, A. (2000) A generic architecture for on-chip packet-switched interconnections, Proceedings of Design, Automation and Test in Europe (DATE 2000), pp.250256.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Existing Topologies in NoC

Butterfly Fat Tree (BFT)
2DBFT of 16 cores

In the network, the IPs are placed at the l leaves and d switches i h placed l d at the h vertices. For N number of IPs, the network has, Diameter: log2 N/4 Bisection Width: N

Advantage - Requires lesser number of switches No. of Routers needed: ( N/2) - Low diameter and Large bisection Node Degree: 6 (non-root), 4 (root) width Drawback - High node-degree. node degree.
Reference: Pande, P. P., Grecu, C., Ivanov, A. and Saleh, R. (2003), High-throughput switch-based interconnect for future SoCs, Proc. Intl Workshop on System-on-Chip for Real Time Applications, pp.304310.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Mesh-of-Tree Topology
- In M N MoT where M denotes the number of R Row T Trees and d N denotes the number of Column Trees. Both M and N are power of 2. 2 - Number of nodes = 3*M*N (M + N). Small Diameter (2 log2 M + 2 log2 N). Large g Bisection Width [min (M,N)]. Drawback - Non-planer Non planer topology. topology

4 4 Mesh-of-Tree M h f T connecting ti 32 cores

Reference: Kundu, S. and Chattopadhyay, S. (2008), Mesh-of-Tree Deterministic Routing for Networkon-Chip Architecture, ACM Great Lake Symposium on VLSI, pp. 343346.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Design Issues in Network-on-Chip

Switching Techniques Topology Selection Routing Flow Control Protocol & GALS Implementation Buffering Arbitration

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Routing

Source Routing vs. Distributed Routing

Source routing

Routing control unit in switches is simplified; computed at source. Headers containing the route tend to be larger increase overhead. Next route computed by finite-state machine or by look-up table.

Distributed routing

Deterministic Routing vs. Adaptive Routing

Deterministic routing

Always follow a specified path. Easy to implement and supports in-order delivery. Different paths based on congestion and faults; destroys in-order delivery. Historical channel load information, length of queues, status of nodes and links.

Ad i routing Adaptive i

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Routing Challenges
Live-lock in Adaptive Routing

Livelock
Arises from an unbounded number of allowed nonminimal hops. Solution: restrict the number of non-minimal hops allowed. allowed Arises from a set of packets being blocked waiting only for network resources (i.e., links, buffers) held by other packets in the set. Probability increases with increased traffic & d r decreased d availability. il bilit

Deadlock

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Routing Dependent Deadlock

ci = channel i di = destination d i i node d i s1 c0 c1 d3 c 11 d5 s4 d2 c 10 c9 c8 d1 c2 d4 c5 c7 c6 s3 Channel dependency graph c9 c4 s5 c 12 p5 c 12 c6 p4 c 10 si = source node i pi = packet k i c 3 s2 c3 p3 c7 p4 c 11 p1 c0 p2 c4 p3 c8 p3 p4 c1 p2 c5 p2 p1 c2 p1

Routing of packets in a 2D mesh

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Routing Dependent Deadlock Avoidance

Deterministic Routing in 2D mesh using Dimension Ordered Routing E Establish t bli h ordering d i on all resources b based d on network t k di dimension. i E Example: l X-Y Routing: First, route horizontally and match the Y co-ordinate; and then route vertically and match X co-ordinate.

X YR X-Y Routing tin

N cycle No l in the th Channel Ch nn l Dependency D p nd n Gr Graph ph

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Routing Dependent Deadlock Avoidance

Deadlock Free Adaptive Routing in 2D Mesh: Turn Model
West First

Deterministic Routing

Adaptive Routing p g

North Last

Negative First

Reference: Glass, C. J. and Ni, L. M. (1992), Turn Model for Adaptive Routing, Proceedings of International Symposium on Computer Architecture, pp. 278 287.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Routing Dependent Deadlock Avoidance

Deadlock Free Adaptive Routing in 2D Mesh: Odd-Even Turn Model

Rule 1. Any packet is not allowed to take k an EN turn and d ES turn at any nodes located in an even column.

Rule 2. Any packet is not allowed to take an NW turn and SW turn at any nodes located in an odd column.

Reference: Chiu, G. M. (2000), The Odd-Even Turn Model for Adaptive Routing, IEEE Transactions on Parallel and Distributed Systems, pp. 729 738.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Routing Dependent Deadlock Avoidance

Deterministic Routing in 2D Torus and Folded Torus by using Virtual Channels Messages at a node numbered less than their destination node are routed on the high channels, and messages at a node numbered greater than their destination node are routed on the low channels.

n0
n0n2

n1
n1n3

n2
n2n0

n3
n3n1

Reference: Dally, W. J. and Seitz, C. L., (1987) Deadlock Free Message Routing in Multiprocessor Interconnection Networks, IEEE Transactions on Computers, vol. C-36, no. 5, pp. 547 553.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Deadlock Recovery

Allow deadlock to occur, but once a potential deadlock situation is detected, break at least one of the cyclic dependencies to gracefully recover. The common techniques are, Regressive recovery (abort-and-retry): Remove packet(s) from a dependency cycle by killing (aborting) and later reinjecting j g( (retry) y) the p packet(s) ( ) into the network after some delay. Progressive recovery (preemptive): Remove packet(s) from a dependency d d cycle l by b rerouting i the h packet(s) k ( ) onto a deadlock-free lane.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Design Issues in Network-on-Chip

Switching Techniques Topology Selection Routing Flow Control Protocol & GALS GALS Implementation Implementation Buffering Arbitration

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Flow Control Protocol

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Flow Control Protocol

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Flow Fl Control C l Protocol P l

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Globally Asynchronous Locally Synchronous (GALS) style of Communication

Reference: Kundu, S. and Chattopadhyay, S. (2007) Interfacing Cores and Routers in Network-on-Chip Using GALS, IEEE International Symposium on Integrated Circuits (ISIC 2007), pp.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Design Issues in Network-on-Chip

Switching Techniques Topology Selection Routing Flow Control Protocol & GALS Implementation Buffering Arbitration

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Counter based FIFO

Binary Counter Based

- Drawback 1. There can be considerable ambiguity when a count is read during count transition.

Gray Code Counter Based

- Drawback 1. Power of 2 FIFO depth. Area wastage for non- binary FIFO depth.

Reference: Yi, Cheng, Gray code sequences, U. S. Patent 6703950, March 9, 2004.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Gray Counter Based Dual Clock FIFO

Reference: Cummings, C. E. and Alfke, P. (2002) Simulation and Synthesis Techniques for Asynchronous FIFO Design with Asynchronous Pointer Comparisons, Synopsys Users Group Conference, vol. User Papers.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Functionality of Asynchronous Comparator

Full = ( (waddr == raddr) && (wr_dir != rd_dir) ) Empty = ( (waddr == raddr) && (wr_dir == rd_dir) )

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Metastability
Full and Empty Signals are controlled by both the clocks. Thus probability of arising Metastable States. 2-State Synchronizer are used to reduce the probability of Metastability. Full Signal is synchronized with the wr-clk and Empty Si Signal li is synchronized h i d with ih the rd-clk.

Full = ( (waddr == raddr) && (wr_dir != rd_dir) ) E Empty =(( (waddr dd == raddr) dd ) && ( (wr_dir di == rd_dir) d di ) )

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Design Issues in Network-on-Chip

Switching Techniques Topology Selection Routing Flow Control Protocol & GALS Implementation Buffering Arbitration

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Arbitration

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Router Architecture

Input Channel Input Buffer Routing Computation Unit Control Unit

Output Channel Output Buffer Arbiter Control Unit

Reference: Kundu, S. and Chattopadhyay, S. (2008) Network-on-chip architecture design based on Mesh-of-Tree deterministic routing topology, Intl Journal of High Performance Systems Architecture, Vol. 1, No. 3, pp. 163-182.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Wormhole Router Architecture Data Path

Physical channel
Link k Contr rol RoutingControlUnit (RC) Header Flit

Physical channel

Cross sBar(ST)

Inputbuffer b ff (IB)

Output p buffer (OB)

Lin nk Control Link C Control

Routing Algorithm Link Control C Inputbuffer (IB) RoutingControlUnit (RC)

Outputbuffer (OB)

CRITICAL PATH
IB( (Input p Buffering) g)

Header Fli Flit Routing Algorithm

RC( (Route Computation) p )

Arbitration Unit(SA) Output Port#

SA( (Switch Alloc) )

Crossbar Control

ST( (Switch Trav) )

OB( (Output p Buffering) g)

Physical channel

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Flit Traversal Through Wormhole Router

Physical channel

Link L Co ontrol

Physical channel

Link L Co ontrol

Inputbuffer (IB)

RoutingControlUnit (RC) Header Flit

Routing Algorithm g

Outputbuffer (OB)

Arbitration Unit(SA)

Output Port#

Crossbar Control
OB(OutputBuffering)

IB(InputBuffering)

RC(Route Computation)

SA(Switch Alloc) ST(Switch Trav)

Packet Header Packet Payload 1 Packet Payload 2 Packet Payload 3

IB RC SA ST OB IB IB IB IB ST OB IB IB IB ST OB IB IB S ST O OB

Lin nk Con ntrol

Routing Algorithm

Physical channel

RoutingControlUnit (RC) Header Flit

CrossBar(ST T)

Inputbuffer (IB)

Outputbuffer (OB) ( )

Link L Co ontrol

Physical channel

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Performance Evaluation
Performance Metrics
Throughput: g p TP =
(Maximum Accepted Packets) x (Packet length) Unit: flits/ / cycle/ y / IP (Number of IP blocks) x (Total time)

Latency: The time (in clock cycles) that elapses from between the occurrence of a
message header injection into the network at the source node and the occurrence of a tail flit reception at the destination node. node Lavg =

Li P

P = total number of messages, Li= latency of each message i.

Bandwidth: Bandwidth refers to the maximum number of bits can send successfully to the destination through the network per second. It is represented as bps (bits/sec).

Cost Metrics
Average energy/packet and average energy/clock cycle are being measured. taken into consideration.

Energy dissipation: d Energy consumed by routers and links at different workload. Area requirements: Percentage chip area occupied by the switch and links have

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Simulator Design for Performance Evaluation

Types of Simulator 1. Cycle Accurate: Sample the state of the signals at every clock edge (positive or negative). Much faster than event driven simulation. 2. Event Driven: iven: Most accurate as every active signal is calculated for every device during the clock cycle as it propagates. Each signal g is simulated for its value and its time of occurrence. Excellent for timing analysis and verify race conditions. Computation intensive (depends on the number of activities) and hence very slow. slow

To calculate the performance metrics like throughput, latency etc., the delay after each and every gate is not required. In that case Cycle Accurate Simulator is the best choice.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Existing NoC Simulators

Some Existing NoC Simulators
NIRGAM University of Southampton, UK MPARM - Xpipes University of Bologna, Italy NS2 Open Source

Drawbacks
limited li i d to M Mesh h topology; l No power evaluation Not freely available Packet level transaction

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Cycle Accurate Simulator for NoC Modeling

The simulator should operate at the granularity of individual architectural

components co po e s o of the e router. ou e . SystemC is normally preferred. Traffic Generators are used for evaluating the performance of NoC.
Input Channel Input Buffer Routing Computation Unit Control Unit Output Channel Output Buffer Arbiter Control Unit

Router
Traffic Generation Poisson Distribution Self-Similar Traffic Application Appli ti n Specific Sp ifi Traffic Tr ffi Network 1. Throughput 2. Latency 3. Bandwidth

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Traffic Generator
Application Driven Traffic is the best suited for performance evaluation. D t Due to unavailability il bilit of f the th same, synthetic th ti traffic t ffi source models d l are also l used. d Nature of traffic is generally bursty in NoC.

A Poisson process
When observed on a fine time scale will appear bursty Burst length of a Poisson arrival process tends to be smoothed by averaging over long enough time scale. P i Poisson process fail f il to capture the actual burstiness of NoC traffic . Short range Dependence

Reference: Varatkar, G.V. and Marculescu, R. (2004) On-chip traffic modeling and synthesis for MPEG-2 video applications, IEEE Trans. on Very Large Scale Integration (VLSI) Systems, Vol. 12, No. 1, pp. 108-119.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Traffic Generator
A Self-Similar (fractal) process
When aggregated over wide range of time scales, will maintain its bursty characteristic. Self-similarity manifests itself in several equivalent fashions: Slowly decaying variance Long range dependence Non-degenerate autocorrelations Heavy Tailed

A Self-Similar process can be generated by super-positioning ON-OFF Pareto Sources

Reference: Park, K. and Willinger, W. (2000) Self-Similar network traffic and performance evaluation, A Wiley-Interscience Publication, John Wiley & Sons, Inc.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Traffic Parameter
Offered Load: Number of packets injected for particular time interval. Locality Factor: Ratio of traffic destined to the local cluster from a core to the total traffic injected by each core. Locality Factor = 0 signifies Uniform Distributed Traffic. For example in 4x4 Mesh, the distances (d) of the destinations from one corner , 2, , 3, , 4, , 5, , and 6. If locality y factor = 0.5, , then source are at d = 1, 9 50 percent of the traffic will go to the cluster having d = 1. 9 Rest 50 percent traffic will be distributed as o 15% will go to the cluster having d = 2 o 12.5% will go to the cluster having d = 3 o 10% will go to the cluster having d = 4 o 7.5% will go to the cluster having d = 5 o 5% will go to the cluster having d = 6 If there is more than one core in a cluster, the traffic will be randomly distributed among them. S
d=0 d=1 d=2 d=6 d=5 d=4 d=3

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Performance Evaluation
Performance of any network depends on the following network parameters.
Topology T l Locality factor of the traffic Buffer Position and Buffer Depth S i hi Techniques Switching T h i Number of cores attached

Theoretically, Theoretically
Throughput Number of Links Average Distance

Latency Average Distance M Mesh BFT

Here, ,W Wormhole router architecture is used to form the network with following parameters, Number of cores attached = 32 Message Length = Packet Length = 64 flits Each flit consists of 32 bits Total Simulation cycle = 2 lacs with 10,000 cycle settling time

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Performance Evaluation
Throughput varies with topology and locality factor

Throughput = Maximum Accepted Traffic in flits/cycle/IP We kept buffer depth = 6 in both input and output channels of the router in all the cases

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Performance Evaluation
Latency decreases with increase in Locality Factor in different topologies

We kept buffer depth = 6 in both input and output channels of the router in all the cases

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Power Evaluation Flow

Router Power Evaluation

Operating Condition: Process = 1, Voltage = 1 volt, Temp = 750 C

Reference: Synopsys prime power , Design vision manual.(Version Y-2006.06)

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Power Evaluation Flow

Link Length Estimation Mesh

Estimated Length of Wires: 1.25 mm, 2.5 mm

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Power Evaluation Flow

Link Length Estimation Butterfly fl Fat T Tree (BFT) ( T)

Estimated Length of Wires: 1.25 mm, , 5.0 mm

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Power Evaluation Flow

Interconnect Modeling
Copper wire (resistivity = 17 n-m) of Metal Layer 4 (Semi-global) has been taken.
To reduce the wiring area we have chosen the minimum dimension of Metal Layer 4. The dimensions are, Width (W) = 0.2 m Spacing (S) = 0.2 m Pitch = W + S = 0.4 m (T) ) = 0.5 m Thickness ( H = 0.75 m Dielectric Constant = 2.9

Layer 5

Layer 4

Layer 3
C Cross-section i of f i interconnects

Link Energy Evaluation

Parasitic Components (R, C, L) of Three Wire Model has been extracted from Field Solver tool of HSPICE. The energy gy consumption p of middle wire for different transitions is also obtained from HSPICE.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Power Evaluation Flow

Three wire modeling
Data rate : 32 200 M bits/sec Driver sizes are designed based on length of the wire. Load Capacitance on the other end of the wire is 5fF Look Up Table (LUT) is made for middle line energy consumption

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Energy Consumption in Mesh Topology

Network Energy = Router Energy + Link Energy
Si l i runs f Simulation for 2 l lacs clock l k cycle l with i h clock l k period i d of f 5 ns

Internal Power D i t Dominates

We kept buffer depth = 6 in both input and output channels of the router in all the cases

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Comparison of Energy Consumption

We kept buffer depth = 6 in both input and output channels of the router in all the cases

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Energy Performance Trade Trade-Off Off

Throughput Variation with FIFO Depth & Position in Mesh

FIFO_Depth_4-4 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 4 FIFO_Depth_4-6 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 6 FIFO_Depth_6-6 => Input Channel FIFO Depth =6, Output Channel FIFO Depth = 6 FIFO_Depth_4-0 => Input Channel FIFO Depth =4, No FIFO at Output Channel FIFO_Depth_6-0 => Input Channel FIFO Depth =6, No FIFO at Output Channel

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Energy Performance Trade Trade-Off Off

Latency Variation with FIFO Depth & Position in Mesh

FIFO_Depth_4-4 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 4 FIFO_Depth_4-6 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 6 FIFO_Depth_6-6 => Input Channel FIFO Depth =6, Output Channel FIFO Depth = 6 FIFO Depth 4-0 FIFO_Depth_4 0 => > Input Channel FIFO Depth =4, 4, No FIFO at Output Channel FIFO_Depth_6-0 => Input Channel FIFO Depth =6, No FIFO at Output Channel

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Energy Performance Trade Trade-Off Off

Energy Variation with FIFO Depth & Position in Mesh
Simulation runs for 2 lacs clock cycle with clock period of 5 ns

FIFO_Depth_4-4 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 4 FIFO D h 4 6 => FIFO_Depth_4-6 > Input I Channel Ch l FIFO Depth D h =4, 4 Output O Channel Ch l FIFO Depth D h=6 FIFO_Depth_6-6 => Input Channel FIFO Depth =6, Output Channel FIFO Depth = 6 FIFO_Depth_4-0 => Input Channel FIFO Depth =4, No FIFO at Output Channel FIFO_Depth_6-0 => Input Channel FIFO Depth =6, No FIFO at Output Channel

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Energy Performance Trade Trade-Off Off

Trade-Off in Mesh at saturation (load = 160)

FIFO D h 6 0 shows FIFO_Depth_6-0 h best b E Energy-Performance P f Trade-Off T d Off

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Network Energy Consumption in Mesh after FIFO Optimization

Si l i runs f Simulation for 2 l lacs clock l k cycle l with i h clock l k period i d of f 5 ns

Internal Power Still Dominates

We kept FIFO depth = 6 in input channel and no FIFO at output channel

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Comparison of Energy Consumption after FIFO Optimization

We kept FIFO depth = 6 in input channel and no FIFO at output channel

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Internal Power
Netlist View of a D-type flip-flop with synchronous clear input in S Synopsys D Design i Vi Vision i

Internal power = short circuit power + Internal node switching power Output node of the clock-buffer switches continuously with free running clock To minimize Internal Power: Stop the clock when the network is idle

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Internal Power Minimization

Netlist View of FIFO Memory

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Network Energy Consumption in Mesh after Clock Gating in FIFO

Simulation runs for 2 lacs clock cycle with clock period of 5 ns

We kept FIFO depth = 6 in input channel and no FIFO at output channel

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Comparison of Energy Consumption after Clock Gating in FIFO

We kept FIFO depth = 6 in input channel and no FIFO at output channel

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Network Area Comparison

% SoC Area Overhead BFT 2 424 2.424 Mesh 3 701 3.701

Total Core Area = (32 * 2.5 * 2.5) sq. mm. = 200 sq. mm.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Scalability Measurement
Scalability is a property which exhibits performance proportional to the

number of cores employed. As the size of a scalable system is increased, a corresponding increase in performance is obtained.

BW = [(Throughput * Number of cores attached * Number of bits in a flit) / clock period]

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Head Head-of-Line of Line Blocking in Wormhole Router

VC0

2Dmesh,noVCs,XYrouting

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Introduction of Virtual Channels

Multiple Virtual Channels multiplexed on a single physical link to improve performance. Payload flits use the VC acquired by the header flit while tailer flit releases VC.
SwitchA VC0 SwitchB VC0

VC1

DEMUX

Physical y datalink
MUX

VC1

VCcontrol

VCScheduler

Reference: Dally, W. J. (1992) Virtual Channel Flow Control, IEEE Trans. on Parallel and Distributed Systems, Vol. 3, No. 2, pp. 194205.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Virtual Channels
VC0 VC1

2D mesh, 2 VCs, XY routing

VC avoids HOL blocking. g

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Virtual Channels
VC0 VC1

X
NoVCs available

X
2D mesh, 2 VCs, XY ro ting routing

VC mitigates HOL blocking but can not eliminate li i i it

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Virtual Channel Based Router Architecture

Inputbuffers

Link Contro ol

Cross sBar

Inputbuffers

Lin nk Control

RoutingControl and ArbitrationUnit

Lin nk Co ontrol Physical channel

Physical channel

...

MUX M

Link Contro ol Ph hysical ch hannel

Ph hysical ch hannel

...

MUX

DEMUX DEM MUX

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Virtual Channel Based Router Architecture

Link Control Link Control ysical Phy cha annel Link Control Physical chann nel Physical channel
Inputbuffers

Inputbuffers

Physic cal chann nel

Link Control

MUX

CrossBar

...

RoutingControl and ArbitrationUnit

Reference: N. Kavaldjiev, G. J. M. Smit, and P. G. Jansen, A Virtual Channel Router for On-Chip Networks, in Proc. of IEEE Intl SOC Conference. IEEE Computer Society Press, pp. 289293, 2004.

MUX

DEMUX DEMUX

MUX

...

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Determination of Number of Virtual Channels

- Upto 4 virtual channels throughput increases, but beyond that it saturates. - Energy dissipation increases with increase in the number of virtual channels. - For Energy-Performance gy Trade-off, , 4 virtual channels with each p physical y channel is preferred.
Reference: Pande, P. P., Grecu, C., Jones, M., Ivanov, A. and Saleh, R. (2005) Performance evaluation and design trade-offs for MP-SOC interconnect architectures, IEEE Trans. on Computers, Vol. 54, No. 8, pp.10251040.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Throughput Improvement in Mesh using Virtual Channel Architecture

N of No. f Virtual Vi l Channel Ch l=4

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Latency Improvement in Mesh using Virtual Channel Architecture

No. of Virtual Channel = 4

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Energy Overhead in Mesh using Virtual Channel Architecture

No. of Virtual Channel = 4

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

100

Performance of Some Other Topologies

No. of Virtual Channel = 4

Reference: Pande, P. P., Grecu, C., Jones, M., Ivanov, A. and Saleh, R. (2005) Performance evaluation and design trade-offs for MP-SOC interconnect architectures, IEEE Trans. on Computers, Vol. 54, No. 8, pp.10251040.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

101

Network Area Comparison with Virtual Channel Architecture

% SoC Area Overhead Mesh Without VC 3.701 With VC 6.145 2.424 BFT Without VC With VC 3.507

Total Core Area = (32 * 2.5 * 2.5) sq. mm. = 200 sq. mm.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

102

Quality of Service (QoS) Support

Conceptually, networks two disjoint
a network with throughput and latency y g guarantees (g (guaranteed throughput, GT) a network without those guarantees (best-effort, ( , BE) )

Several types of commitment in the network

Architectural Modification for Supporting QoS combine bi guaranteed d worst-case behavior with good average resource usage

Reference: Rijpkema, E., Goossens, K., Radulescu, A., Dielssen, J., Meerbergen, J. V., Wielage, P., and Waterlander, E. (2003) Trade-offs in the Design of a Router with Both Guaranteed and Best-Effort Services for Networks on Chip, IEE Proc. Computers and Digital Techniques, Vol. 150, No. 5, pp. 294-302.

Lecture 3

Application pp Mapping pp g

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

104

Task of Application Mapping

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

105

Mapping M i Problem P bl Formulation F l i Core C Graph G h

Directed graph G = (V, E) Each vertex vi represents a core Each edge ei,j E represents communication between vi and vj Weight of edge ei,j is commi,j, is the bandwidth requirement

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

106

Mapping Problem Formulation NoC Topology Graph

A directed graph P = (U,F) Each vertex ui U is a router Each edge fi,j F represents a direct communication between the vertices

Weight g of edge g fi,j y i j denoted by bwi,j represents the available bandwidth across the edge

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

107

Map M Function F i
map: V U Each edge k of the core graph represents a commodity dk Each commodity has a value vl(dk) representing the bandwidth d d requirement of f the communication from f vi to vj Bandwidth constraint: An edge in the topology graph must have enough bandwidth to accommodate all commodities passing through it Minimize communication cost: k vl(d l(dk) dist(source(d di ( (dk), ) dest(d d (dk))

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

108

Mapping M i Solution S l i

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

109

Mapping M i Algorithms Al i h
Mapping problem is intractable Several approaches are possible: ILP, Heuristics (PMAP, GMAP, PBB, NMAP, BMAP etc.), Meta-search heuristics (GA, PSO, Simulated Annealing) Other variants of the problem combining, Task T k scheduling h d lin Power consumption Alternative routing paths etc.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

110

Mapping M i with i h Minimum-Path Mi i P h Routing R i (NMAP)

Three phases Initialize, Minimum path computation, Iterative improvement Initialize:
1. 2. 3. Core with maximum communication demand placed onto the node with maximum number of neighbors g Select the core that communicates most with the mapped cores Place selected core onto the node that minimizes communication cost with mapped ones

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

111

Mapping M i with i h Minimum-Path Mi i P h Routing R i

Shortest Path:
Minimum path routing Commodities are sorted on descending order of flows For each commodity, shortest path is identified As soon as a commodity path is finalized, finalized cost of each edge on the path increased by the value of the commodity

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

112

Mapping M i with i h Minimum-Path Mi i P h Routing R i

Iterative Improvement: Iteratively swap vertices pair-wise to obtain a better mapping Traffic splitting: Multiple shortest paths may exist Formulate a multi-commodity flow problem to satisfy bandwidth requirements for solutions that have lesser communication costs but do not satisfy y all the bandwidth requirements

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

113

Binomial Bi i l Mapping M i Algorithm Al i h (BMAP)

NMAP algorithm is O(N4logN) BMAP is a three stage algorithm with complexity O(N2logN) Binomial Merging Iteration Topology Mapping Hardware cost Optimization

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

114

BMAP: Binomial Merging Iteration

1. Calculate IP Ranking: Rank of IP core i, ranking(i) = (requirement(i, j) + requirement(j, i), j = 1 to N requirement(i, j) is the bandwidth requirement from i to j 2. Merge IP Set: Based on ranking merge two IP-sets at a time: logN time 3. Refreshing IP Set: Ranking is recalculated. Ranking of IP Set k generated by merging IP Set i and IP Set j is, ranking(k) = ranking(i) + ranking(j) requirement(i,j) requirement(j i) requirement(j,i)

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

115

Merging: M i An A Example E l

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

116

BMAP: BMAP Topology T l Mapping M i and d Traffic T ffi Surface Creation

After mapping, a traffic surface is generated It shows the traffic load of each router Minimal Mi i l path h routing i i is used d Based on this surface, hardware can be optimized p by y selecting g proper routers from the library

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

117

BMAP: BMAP Hardware H d Cost C Optimization O i i i

1 Dummy Router Elimination: 1.
Dummy routers added at start point to have 4n routers BMAP puts these routers at boundaries, hence can be eliminated

2. Router Selection:
Sharing single buffer among low bandwidth input channels Choice of router is made from library

3 Unfolding: 3.
Add additional routers and links for larger bandwidth requirements

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

118

BMAP: BMAP Hardware H d Optimization O i i i - An A Example E l

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

119

Network on Chip Synthesis: SUNMAP + xpipes

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

120

SUNMAP: SUNMAP Topology T l Mapping M i

Optimizes for area, area power or delay within design constraints Uses heuristics to perform mapping onto topologies: mesh, torus, hypercube, clos, and butterfly Built B ilt in fl floor-planner pl nn f for area, p power analysis n l i Choice of different routing functions

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

121

SUNMAP: SUNMAP Topology T l Mapping M i

Heuristic approach with several phases: Initial mapping using a greedy algorithm (from communication graph) Compute optimal routing (using flow formulation) 1. Floorplan solution 2. Check area and bandwidth constraints 3. Compute mapping cost Iterative improvement loop (Tabu search) Allows manual and interactive topology creation

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

122

System configuration
// In this topology: 8 cores, 8 memories, 4x4 torus // ----------------------------- IP cores // name, switch number, clock divider, buffers, type core(core_0, ( 0 switch_0, i h 0 1, 1 6, 6 initiator); i ii ) core(mem_8, switch_11, 1, 6, target:0x00); [] // ----------------------------- switches // name, , input p p ports, , output p p ports, , buffers switch(switch_0, 5, 5, 6); switch(switch_1, 5, 5, 6); [] // ----------------------------- links // name name, so source, rce destination link(link0, switch_0, switch_1); link(link1, switch_1, switch_0); [] // ----------------------------- routes // source, destination, hops route(core_0, pm_8, switches:0,1,5,6,7,11); route(core_1, pm_9, switches:1,5,9,8); route(core_2, pm_10, switches:2,6,5,9); route(core 3 pm_11, route(core_3, pm 11 switches:3,2,6,10); switches:3 2 6 10); []

Specifies NIs (I/Os, clocks, buffers) switches (I/Os, buffers) links routes

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

123

xpipes i Compiler: C il Platform Pl f Generation G i

Input: System configuration: Topology, Routing tables, Parameters(flit width, buffering, ) Component Library Creates a class template for each type of network component p n nt b based d upon p n component p n nt configuration nfi ti n (I/O ports, buffer sizing) Hierarchical instantiation of the p platform in SystemC y

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

124

Network-on-Chip Synthesis Tool: xpipes

MPARM Architecture

Reference: Bertozzi, D. and Benini, L. (2004) xpipes: A Network on-Chip Architecture for Giga Scale Systems-on-Chips, IEEE Circuits and Systems Magazine, pp. 18-31.

Lecture 5 Conclusion and Future of Network-on-Chip

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

126

Network-on-Chip: Network on Chip: At a Glance

Topics Covered
Need of Network-on-Chip
NoC Architecture Design Performance Evaluation Design Trade-Off Application Mapping on NoC Signal g Integrity g y and Reliability y Issues

Some More Topics

Impact I of f Hi Higher h C Communication i i L Layers in NoC Performance Test and Verification of NoC Thermal Modeling of NoC Metrics and Benchmarks for NoC. Floorplan-aware NoC architecture optimization Fault Tolerant Architecture in NoC CAD Tools for NoC

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

127

Limitation of 2D Network-on-Chip Network on Chip

The conventional 2D integrated circuit (IC) has limited floor-planning floor planning choices, and consequently, it limits the performance enhancements arising out of NoC architectures. Need for more and more bandwidth but not at the cost of increased power consumption.

Reference: Carloni, L. P., Pande P. P., Yuan X. (2009) Networks-on-Chip in emerging interconnect paradigms: Advantages and Challenges ACM/IEEE Intl Symp. On Network s-on-Chip, pp. 93-102.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

128

NoC Research Groups in Foreign Universities

1. 2. 3 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14 14. 15. 16. Prof. Luca Benini, University of Bologna, Italy. Prof. Giovanni De Micheli, EPFL, Switzerland. Prof William J. Prof. J Dally, Dally Stanford University University, USA USA. Prof. Partha Pratim Pande, Washington State University, USA. Prof. Radu Marculescu, Carnegie Mellon University, USA. Prof. Bashir M Al-Hashimi, University of Southampton, UK. Prof. Chita R. Das, Pennsylvania State University, USA. Prof. Niraj K. Jha, Princeton University, USA. Prof. Sashi Kumar, Jonkoping University, Sweden. Prof. Axel Janstach, Royal Institute of Technology (KTH), Sweden. Prof. Jari Nurmi, Tampere University of Technology, Finland. Prof. Andre Ivanov, University of British Columbia, Canada. Prof. Resve Saleh, University of British Columbia, Canada. P f Israel Prof. I l Cid Cidon, T Technion-Israel h i I lI Institute i of f T Technology, h l I Israel. l Dr. Davide Bertozzi, University of Bologna, Italy. Dr. Srinivasan Murali, EPFL, Switzarland.

and d many more

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

129

NoC Research in Indian Universities

1. 2. 3. 4. Prof. Santanu Chattopadhyay, Indian Institute of Technology, Kharagpur. Prof. S. K. Nandy, y, Indian Institute of Science, , Bangalore. g Prof. Bharadwaj Amruthur, Indian Institute of Science, Bangalore. Prof. M. R. Bhujade, Indian Institute of Technology, Bombay.

J Journals, l C Conference, f and dW Workshop k h on N NoC C

Microprocessor and Microsystems Journal Journal, Elsevier (MICPRO) IEEE/ACM International Symposium on Networks-on-Chip Networks on Chip IEEE Int l Workshop on Network on Chip Architectures (NoCArc) Intl

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

130

NoC Research in Industries

Tilera Corporation Arteris Inc. Silistix Inc. NXP Semiconductor

IBM Corporation (Cyclops-64/Blue Gene)

130

Aethereal

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

131

Network-on-Chip Network on Chip Books

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

132

Network-on-Chip Network on Chip Books

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

133

Network-on-Chip Network on Chip Books

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

134

Bibliography
For detailed updated reference, the audience are directed to the following link:

http://www.cl.cam.ac.uk/~rdm34/onChipNetBib/onChipNetwork.pdf
Below we are giving some of our contributions in NoC research: [1] S. Kundu and S. Chattopadhyay, Interfacing Cores and Routers in Network-on-Chip Using GALS, IEEE International Symposium on Integrated Circuits (ISIC), 2007. [2] S. Kundu and S. Chattopadhyay, Mesh-of-Tree Deterministic Routing for Network-on-Chip Architecture, ACM Great Lake Symposium on VLSI (GLSVLSI), (GLSVLSI) 2008. 2008 [3] S. Kundu, R. P. Dasari, K. Manna, and S. Chattopadhyay, Mesh-of-Tree based scalable Network-on-Chip Architecture, IEEE Region 10 Colloquium and International Conference on Industrial and Information Systems (ICIIS), 2008. [4] S. Kundu and S. Chattopadhyay, Mesh-of-Tree based Network-on-Chip Architecture Using Virtual Channel based Router IEEE VLSI Design and Test Conference (VDAT), 2008. [5] S. Kundu and S. Chattopadhyay, Network-on-chip architecture design based on mesh-of-tree deterministic routing topology. International Journal for High Performance Systems Architecture, Vol. 1, No. 3, pp.163182, Inderscience Publisher, 2008. [6] S. Kundu, d , R. P. Dasari, , K. Manna, , and d S. Chattopadhyay, p d y y, Performance Evaluation of Mesh-of-Tree Based d Network-on-Chip Using Wormhole Router with Poisson Distributed Traffic, IEEE VLSI Design and Test Conference (VDAT), 2009. [7] S. Kundu, K. Manna, S. Gupta, K. Kumar, R. Parikh, and S. Chattopadhyay, A Comparative Performance Evaluation Of Network-on-Chip Architectures Under Self-Similar Traffic, IEEE International Conference on Ad Advances i Recent in R t Technologies T h l i in i Communication C i ti and d Computing C ti (ARTCom), (ARTC ) 2009. 2009