(Tutorial) NoC The Next Generation of Multi-Processor SoC
(Tutorial) NoC The Next Generation of Multi-Processor SoC
N t Network-on-Chip k Chi
The Next Generation of Multi-Processor System-on-Chip
Presenters
Dept. p of Electronics and Electrical Communication Engineering g g Indian Institute of Technology, Kharagpur.
18th Feb, 2011
Lecture 1
Introduction
Introduction I t d ti
End Node
Device SW Interface HW Interface
End Node
Device SW Interface HW Interface
End Node
Device SW Interface HW Interface
End Node
After mass market production of dual-core and quad-core processor chips, the trend towards Multi-Core Multi Core processing is now a well established one. In multi-core multi core processing, multiple processor (i.e. CPU, DSP) along with multiple computer components (i.e. microcontroller, memory blocks, timers, etc.) are integrated onto a single silicon chip. chip This architecture is often called as Multi-Processor System-on-Chip (MPSoC) (MPSoC).
Link
Link
Communication Medium
Link
Link
Introduction
System-on-Chip (SoC)
Each on chip component referred t as Intellectual to I t ll t l Property P t (IP) block. The communication medium used in modern multi-processor chips is bus based. Upto tens of cores in a single chip, the performance of these bus based chips are satisfactory. But beyond that its performance degrade with number of cores attached.
Reference: International Technology Roadmap for Semiconductor (ITRS) Documents (2003), Available at: http://public.itrs.net/Files/2003ITRS/Home2003.htm.
Segmented Bus
Shared global bus is segmented by inserting repeaters (R). In segmented bus, delay increases linearly with decrease in process technology . No improvement p in bandwidth as it is still shared by y all the cores attached to it. At the system level, it has a profound effect in changing the focus from computation to communication. communication
5 4
Advantage: Bandwidth is higher than the shared bus. Drawback: Switch size increases with increase in number of cores. Number of links needed increases exponentially as the number of cores increases. More number of metal layers are required in placement and routing.
Node
Crossbar switch and Point-to-point links. Advantage: A crossbar switch enhance the scalability to some extent. Drawback: However, connecting large number of cores with a single g switch is not very effective as it is not ultimately scalable and, , thus, , it is an intermediate solution.
Node
Node
10
o Off-chip networks has higher latency than their on-chip counter part. o Area is not a strong constraint for off-chip networks, but for on-chip network it is one of the major constraint. constraint
Reference: Benini, L. and Micheli, G.D. (2002) Network on chips: a new SOC paradigm, IEEE Computer, Vol. 35, No. 1, pp.7078.
11
12
NoC
Aggregate bandwidth grows Speed unaffected by N Distributed arbitration Separate abstraction layers However: Complex architecture.
NoC
SoC
13
High throughput Low latency S l bl architecture Scalable hi Less energy consumption Smaller area requirements R li bili i Reliability in C Communication. i i Quality-of-Service Support
15
16
Switching S it hi Techniques T h i
Circuit Ci it S Switching it hi
Buffers for request tokens
Request for circuit establishment (routing and arbitration is performed during this step)
17
Circuit Ci it Switching S it hi
Buffers for ack tokens
Acknowledgment and circuit establishment (as token travels back to the source, source connections are established)
18
Circuit Ci it Switching S it hi
Source end node Request for circuit establishment Acknowledgment andcircuit establishment Message transport (neither routing ( g nor arbitration is required) q )
19
Circuit Ci it Switching S it hi
X
Source end node Acknowledgment andcircuit establishment Packet transport High contention,low utilization low throughput Destination end node
20
Switching Techniques
Store-and-forward Packet switching
Packets are completely stored before any portion is forwarded Buffers for data packets
Store
21
Switching Techniques
Store-and-forward Packet switching
Packets are completely stored before any portion is forwarded Latency per router depends on the size of the packet
Requirement: buffers must be sized to hold entire packet
Store Forward
Drawback:
1. 2 2. Larger Buffer, M r Latency More L t n
22
Switching Techniques
Virtual Cut-Through Packet Switching
Packets completely stored at the switch
Requirement: buffers must be sized to hold entire packet
Drawback:
Busy Link
Latency/ y/ router reduced by y forwarding g header flit of a packet p as soon as space p for the entire packet in the next router.
23
Switching Techniques
Wormhole Packet Switching
Advantage: Lower Buffer Space, Lesser Latency. Dra back: Throughput Drawback: Thro ghp t lesser than Virtual Virt al Cut C t Through Thro gh Requirement: R i packets can be larger than buffers
Busy Link
24
25
Fli i i Flitization
Header eop bop GT/BE Src_add Dest _add (32-bit) Payload 1 eop bop GT/BE (32-bit) y 2 eop Payload bop GT/BE (32-bit) Tailer eop bop GT/BE (32-bit) (32 bit) DATA 1 DATA2
DATA n
Deflitization
(64x32)bits ) packet ( p
26
27
Topology Selection
Diameter
Number N b of f Links Li k
A topology with large number of links can support high Maximum shortest p path distance between two nodes in bandwidth. bandwidth the network. Networks with small diameters are preferable. Average Distance Average Distance is the average among the distances between all pairs of nodes of a graph. A topology having lesser average distance is preferable.
Bisection Width
Minimum number of wires removed in order to bisect a network. A larger bisection width enables faster information exchange, and preferable.
Node Degree
Numbers of channels connecting the node to its neighbors. g The lower this number, , the easier to build the network.
Reference: Interconnection Network Architectures (2001) pp.2649, Available at: www.wellesley.edu/cs/ courses/cs331/notes/notesnetworks.pdf
28
All switches are connected to the four closest other switches and target resource block via two opposite unidirectional links, except those switches on the edge of the layout.
For MN Mesh, Di Diameter: t (M + N - 2) Bisection Width: min (M, N) No. of routers required: (M * N) Node Degree: 3 (corner), (corner) 4 (edge), 5 (central). CLICH: Chip-Level Integration of Communicating g Heterogeneous g Elements
Reference: Kumar, S., Jantsch, A., Soininen, J. P., Forsell, M., Millberg, M., Oberg, J., Tiensyrja, K. and Hemani, A. (2002) A network on chip architecture and design methodology, Proc. of. ISVLSI, pp.117124.
29
Wires are wrapped pp around from the top component to the bottom and rightmost to leftmost. leftmost For MN Torus,
Diameter: M/2 + N/2 Bisection Width: 2 * min (M, N) No of routers required: (M * N) No. Node Degree: 5
Disadvantage: The long end-around connections can yield excessive delays. delays
Reference: Dally, W. J. and Towles, B. (2001) Route packets, not wires: on-chip interconnection networks, Proceedings of the 38th Design Automation Conference (DAC 2001), pp.684689.
30
31
Reference: Dally, W.J. and Seitz, C.L. (1986) The torus routing chip, Journal of Distributed Computing, Vol. 1, No. 4, pp.187196.
32
2DOctago on of 8 cores s
Reference: Karim, F., Nguyen, A. and Dey, S. (2002) An interconnect architecture for networking systems on chips, IEEE Micro, Vol. 22, No. 5, pp.3645.
33
A binary tree-based network with N (power of 2) number of IP core has, Diameter: log2 N Bisection Width: 1 No of Routers required: (N/2 1) No. Node Degree: 5 (leaf), 3 (stem), 2 (root)
Reference: Jeang, Y. L., Huang, W. H. and Fang, W. F. (2004) A binary tree architecture for application specific network on chip (ASNOC) design, IEEE Asia-Pacific Conference on Circuits and Systems, pp.877880.
34
Every level has same number switches. The functional IP blocks reside at the leaves and the switches reside at the vertices. For N number of IP blocks, the network has, Diameter: log2 N/4 Bisection Width: N/2 No. of Routers required: (N. log2 N)/8 Node Degree: 8 (non-root node), 4 (root node).
35
In the network, the IPs are placed at the l leaves and d switches i h placed l d at the h vertices. For N number of IPs, the network has, Diameter: log2 N/4 Bisection Width: N
Advantage - Requires lesser number of switches No. of Routers needed: ( N/2) - Low diameter and Large bisection Node Degree: 6 (non-root), 4 (root) width Drawback - High node-degree. node degree.
Reference: Pande, P. P., Grecu, C., Ivanov, A. and Saleh, R. (2003), High-throughput switch-based interconnect for future SoCs, Proc. Intl Workshop on System-on-Chip for Real Time Applications, pp.304310.
36
Mesh-of-Tree Topology
- In M N MoT where M denotes the number of R Row T Trees and d N denotes the number of Column Trees. Both M and N are power of 2. 2 - Number of nodes = 3*M*N (M + N). Small Diameter (2 log2 M + 2 log2 N). Large g Bisection Width [min (M,N)]. Drawback - Non-planer Non planer topology. topology
Reference: Kundu, S. and Chattopadhyay, S. (2008), Mesh-of-Tree Deterministic Routing for Networkon-Chip Architecture, ACM Great Lake Symposium on VLSI, pp. 343346.
37
38
Routing
Routing control unit in switches is simplified; computed at source. Headers containing the route tend to be larger increase overhead. Next route computed by finite-state machine or by look-up table.
Distributed routing
Always follow a specified path. Easy to implement and supports in-order delivery. Different paths based on congestion and faults; destroys in-order delivery. Historical channel load information, length of queues, status of nodes and links.
Ad i routing Adaptive i
39
Routing Challenges
Live-lock in Adaptive Routing
Livelock
Arises from an unbounded number of allowed nonminimal hops. Solution: restrict the number of non-minimal hops allowed. allowed Arises from a set of packets being blocked waiting only for network resources (i.e., links, buffers) held by other packets in the set. Probability increases with increased traffic & d r decreased d availability. il bilit
Deadlock
40
41
42
Deterministic Routing
Adaptive Routing p g
North Last
Negative First
Reference: Glass, C. J. and Ni, L. M. (1992), Turn Model for Adaptive Routing, Proceedings of International Symposium on Computer Architecture, pp. 278 287.
43
Rule 1. Any packet is not allowed to take k an EN turn and d ES turn at any nodes located in an even column.
Rule 2. Any packet is not allowed to take an NW turn and SW turn at any nodes located in an odd column.
Reference: Chiu, G. M. (2000), The Odd-Even Turn Model for Adaptive Routing, IEEE Transactions on Parallel and Distributed Systems, pp. 729 738.
44
n0
n0n2
n1
n1n3
n2
n2n0
n3
n3n1
Reference: Dally, W. J. and Seitz, C. L., (1987) Deadlock Free Message Routing in Multiprocessor Interconnection Networks, IEEE Transactions on Computers, vol. C-36, no. 5, pp. 547 553.
45
Deadlock Recovery
Allow deadlock to occur, but once a potential deadlock situation is detected, break at least one of the cyclic dependencies to gracefully recover. The common techniques are, Regressive recovery (abort-and-retry): Remove packet(s) from a dependency cycle by killing (aborting) and later reinjecting j g( (retry) y) the p packet(s) ( ) into the network after some delay. Progressive recovery (preemptive): Remove packet(s) from a dependency d d cycle l by b rerouting i the h packet(s) k ( ) onto a deadlock-free lane.
46
47
48
49
50
Reference: Kundu, S. and Chattopadhyay, S. (2007) Interfacing Cores and Routers in Network-on-Chip Using GALS, IEEE International Symposium on Integrated Circuits (ISIC 2007), pp.
51
52
Reference: Yi, Cheng, Gray code sequences, U. S. Patent 6703950, March 9, 2004.
53
Reference: Cummings, C. E. and Alfke, P. (2002) Simulation and Synthesis Techniques for Asynchronous FIFO Design with Asynchronous Pointer Comparisons, Synopsys Users Group Conference, vol. User Papers.
54
Full = ( (waddr == raddr) && (wr_dir != rd_dir) ) Empty = ( (waddr == raddr) && (wr_dir == rd_dir) )
55
Metastability
Full and Empty Signals are controlled by both the clocks. Thus probability of arising Metastable States. 2-State Synchronizer are used to reduce the probability of Metastability. Full Signal is synchronized with the wr-clk and Empty Si Signal li is synchronized h i d with ih the rd-clk.
Full = ( (waddr == raddr) && (wr_dir != rd_dir) ) E Empty =(( (waddr dd == raddr) dd ) && ( (wr_dir di == rd_dir) d di ) )
56
57
Arbitration
58
Router Architecture
Reference: Kundu, S. and Chattopadhyay, S. (2008) Network-on-chip architecture design based on Mesh-of-Tree deterministic routing topology, Intl Journal of High Performance Systems Architecture, Vol. 1, No. 3, pp. 163-182.
59
Physical channel
Cross sBar(ST)
Inputbuffer b ff (IB)
Outputbuffer (OB)
CRITICAL PATH
IB( (Input p Buffering) g)
Crossbar Control
Physical channel
Physical channel
60
Link L Co ontrol
Physical channel
Link L Co ontrol
Inputbuffer (IB)
Outputbuffer (OB)
Arbitration Unit(SA)
Output Port#
Crossbar Control
OB(OutputBuffering)
IB(InputBuffering)
RC(Route Computation)
IB RC SA ST OB IB IB IB IB ST OB IB IB IB ST OB IB IB S ST O OB
Routing Algorithm
Physical channel
CrossBar(ST T)
Inputbuffer (IB)
Outputbuffer (OB) ( )
Link L Co ontrol
Physical channel
61
Performance Evaluation
Performance Metrics
Throughput: g p TP =
(Maximum Accepted Packets) x (Packet length) Unit: flits/ / cycle/ y / IP (Number of IP blocks) x (Total time)
Latency: The time (in clock cycles) that elapses from between the occurrence of a
message header injection into the network at the source node and the occurrence of a tail flit reception at the destination node. node Lavg =
Li P
Bandwidth: Bandwidth refers to the maximum number of bits can send successfully to the destination through the network per second. It is represented as bps (bits/sec).
Cost Metrics
Average energy/packet and average energy/clock cycle are being measured. taken into consideration.
Energy dissipation: d Energy consumed by routers and links at different workload. Area requirements: Percentage chip area occupied by the switch and links have
62
To calculate the performance metrics like throughput, latency etc., the delay after each and every gate is not required. In that case Cycle Accurate Simulator is the best choice.
63
Drawbacks
limited li i d to M Mesh h topology; l No power evaluation Not freely available Packet level transaction
64
components co po e s o of the e router. ou e . SystemC is normally preferred. Traffic Generators are used for evaluating the performance of NoC.
Input Channel Input Buffer Routing Computation Unit Control Unit Output Channel Output Buffer Arbiter Control Unit
Router
Traffic Generation Poisson Distribution Self-Similar Traffic Application Appli ti n Specific Sp ifi Traffic Tr ffi Network 1. Throughput 2. Latency 3. Bandwidth
65
Traffic Generator
Application Driven Traffic is the best suited for performance evaluation. D t Due to unavailability il bilit of f the th same, synthetic th ti traffic t ffi source models d l are also l used. d Nature of traffic is generally bursty in NoC.
A Poisson process
When observed on a fine time scale will appear bursty Burst length of a Poisson arrival process tends to be smoothed by averaging over long enough time scale. P i Poisson process fail f il to capture the actual burstiness of NoC traffic . Short range Dependence
Reference: Varatkar, G.V. and Marculescu, R. (2004) On-chip traffic modeling and synthesis for MPEG-2 video applications, IEEE Trans. on Very Large Scale Integration (VLSI) Systems, Vol. 12, No. 1, pp. 108-119.
66
Traffic Generator
A Self-Similar (fractal) process
When aggregated over wide range of time scales, will maintain its bursty characteristic. Self-similarity manifests itself in several equivalent fashions: Slowly decaying variance Long range dependence Non-degenerate autocorrelations Heavy Tailed
Reference: Park, K. and Willinger, W. (2000) Self-Similar network traffic and performance evaluation, A Wiley-Interscience Publication, John Wiley & Sons, Inc.
67
Traffic Parameter
Offered Load: Number of packets injected for particular time interval. Locality Factor: Ratio of traffic destined to the local cluster from a core to the total traffic injected by each core. Locality Factor = 0 signifies Uniform Distributed Traffic. For example in 4x4 Mesh, the distances (d) of the destinations from one corner , 2, , 3, , 4, , 5, , and 6. If locality y factor = 0.5, , then source are at d = 1, 9 50 percent of the traffic will go to the cluster having d = 1. 9 Rest 50 percent traffic will be distributed as o 15% will go to the cluster having d = 2 o 12.5% will go to the cluster having d = 3 o 10% will go to the cluster having d = 4 o 7.5% will go to the cluster having d = 5 o 5% will go to the cluster having d = 6 If there is more than one core in a cluster, the traffic will be randomly distributed among them. S
d=0 d=1 d=2 d=6 d=5 d=4 d=3
68
Performance Evaluation
Performance of any network depends on the following network parameters.
Topology T l Locality factor of the traffic Buffer Position and Buffer Depth S i hi Techniques Switching T h i Number of cores attached
Theoretically, Theoretically
Throughput Number of Links Average Distance
Here, ,W Wormhole router architecture is used to form the network with following parameters, Number of cores attached = 32 Message Length = Packet Length = 64 flits Each flit consists of 32 bits Total Simulation cycle = 2 lacs with 10,000 cycle settling time
69
Performance Evaluation
Throughput varies with topology and locality factor
Throughput = Maximum Accepted Traffic in flits/cycle/IP We kept buffer depth = 6 in both input and output channels of the router in all the cases
70
Performance Evaluation
Latency decreases with increase in Locality Factor in different topologies
We kept buffer depth = 6 in both input and output channels of the router in all the cases
71
72
73
74
Layer 5
Layer 4
Layer 3
C Cross-section i of f i interconnects
Parasitic Components (R, C, L) of Three Wire Model has been extracted from Field Solver tool of HSPICE. The energy gy consumption p of middle wire for different transitions is also obtained from HSPICE.
75
76
77
We kept buffer depth = 6 in both input and output channels of the router in all the cases
78
FIFO_Depth_4-4 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 4 FIFO_Depth_4-6 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 6 FIFO_Depth_6-6 => Input Channel FIFO Depth =6, Output Channel FIFO Depth = 6 FIFO_Depth_4-0 => Input Channel FIFO Depth =4, No FIFO at Output Channel FIFO_Depth_6-0 => Input Channel FIFO Depth =6, No FIFO at Output Channel
FIFO_Depth_4-4 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 4 FIFO_Depth_4-6 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 6 FIFO_Depth_6-6 => Input Channel FIFO Depth =6, Output Channel FIFO Depth = 6 FIFO Depth 4-0 FIFO_Depth_4 0 => > Input Channel FIFO Depth =4, 4, No FIFO at Output Channel FIFO_Depth_6-0 => Input Channel FIFO Depth =6, No FIFO at Output Channel
80
FIFO_Depth_4-4 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 4 FIFO D h 4 6 => FIFO_Depth_4-6 > Input I Channel Ch l FIFO Depth D h =4, 4 Output O Channel Ch l FIFO Depth D h=6 FIFO_Depth_6-6 => Input Channel FIFO Depth =6, Output Channel FIFO Depth = 6 FIFO_Depth_4-0 => Input Channel FIFO Depth =4, No FIFO at Output Channel FIFO_Depth_6-0 => Input Channel FIFO Depth =6, No FIFO at Output Channel
81
82
83
84
Internal Power
Netlist View of a D-type flip-flop with synchronous clear input in S Synopsys D Design i Vi Vision i
Internal power = short circuit power + Internal node switching power Output node of the clock-buffer switches continuously with free running clock To minimize Internal Power: Stop the clock when the network is idle
85
86
87
88
Total Core Area = (32 * 2.5 * 2.5) sq. mm. = 200 sq. mm.
89
Scalability Measurement
Scalability is a property which exhibits performance proportional to the
number of cores employed. As the size of a scalable system is increased, a corresponding increase in performance is obtained.
90
2Dmesh,noVCs,XYrouting
91
VC1
DEMUX
Physical y datalink
MUX
VC1
VCcontrol
VCScheduler
Reference: Dally, W. J. (1992) Virtual Channel Flow Control, IEEE Trans. on Parallel and Distributed Systems, Vol. 3, No. 2, pp. 194205.
92
Virtual Channels
VC0 VC1
93
Virtual Channels
VC0 VC1
X
NoVCs available
X
2D mesh, 2 VCs, XY ro ting routing
94
Link Contro ol
Cross sBar
Inputbuffers
Lin nk Control
Physical channel
...
MUX M
Ph hysical ch hannel
...
MUX
95
Inputbuffers
Link Control
MUX
CrossBar
...
Reference: N. Kavaldjiev, G. J. M. Smit, and P. G. Jansen, A Virtual Channel Router for On-Chip Networks, in Proc. of IEEE Intl SOC Conference. IEEE Computer Society Press, pp. 289293, 2004.
MUX
DEMUX DEMUX
MUX
MUX
...
96
- Upto 4 virtual channels throughput increases, but beyond that it saturates. - Energy dissipation increases with increase in the number of virtual channels. - For Energy-Performance gy Trade-off, , 4 virtual channels with each p physical y channel is preferred.
Reference: Pande, P. P., Grecu, C., Jones, M., Ivanov, A. and Saleh, R. (2005) Performance evaluation and design trade-offs for MP-SOC interconnect architectures, IEEE Trans. on Computers, Vol. 54, No. 8, pp.10251040.
97
98
99
100
Reference: Pande, P. P., Grecu, C., Jones, M., Ivanov, A. and Saleh, R. (2005) Performance evaluation and design trade-offs for MP-SOC interconnect architectures, IEEE Trans. on Computers, Vol. 54, No. 8, pp.10251040.
101
% SoC Area Overhead Mesh Without VC 3.701 With VC 6.145 2.424 BFT Without VC With VC 3.507
Total Core Area = (32 * 2.5 * 2.5) sq. mm. = 200 sq. mm.
102
Reference: Rijpkema, E., Goossens, K., Radulescu, A., Dielssen, J., Meerbergen, J. V., Wielage, P., and Waterlander, E. (2003) Trade-offs in the Design of a Router with Both Guaranteed and Best-Effort Services for Networks on Chip, IEE Proc. Computers and Digital Techniques, Vol. 150, No. 5, pp. 294-302.
Lecture 3
Application pp Mapping pp g
104
105
106
Weight g of edge g fi,j y i j denoted by bwi,j represents the available bandwidth across the edge
107
Map M Function F i
map: V U Each edge k of the core graph represents a commodity dk Each commodity has a value vl(dk) representing the bandwidth d d requirement of f the communication from f vi to vj Bandwidth constraint: An edge in the topology graph must have enough bandwidth to accommodate all commodities passing through it Minimize communication cost: k vl(d l(dk) dist(source(d di ( (dk), ) dest(d d (dk))
108
Mapping M i Solution S l i
109
Mapping M i Algorithms Al i h
Mapping problem is intractable Several approaches are possible: ILP, Heuristics (PMAP, GMAP, PBB, NMAP, BMAP etc.), Meta-search heuristics (GA, PSO, Simulated Annealing) Other variants of the problem combining, Task T k scheduling h d lin Power consumption Alternative routing paths etc.
110
111
112
113
114
115
Merging: M i An A Example E l
116
117
2. Router Selection:
Sharing single buffer among low bandwidth input channels Choice of router is made from library
3 Unfolding: 3.
Add additional routers and links for larger bandwidth requirements
118
119
120
121
122
System configuration
// In this topology: 8 cores, 8 memories, 4x4 torus // ----------------------------- IP cores // name, switch number, clock divider, buffers, type core(core_0, ( 0 switch_0, i h 0 1, 1 6, 6 initiator); i ii ) core(mem_8, switch_11, 1, 6, target:0x00); [] // ----------------------------- switches // name, , input p p ports, , output p p ports, , buffers switch(switch_0, 5, 5, 6); switch(switch_1, 5, 5, 6); [] // ----------------------------- links // name name, so source, rce destination link(link0, switch_0, switch_1); link(link1, switch_1, switch_0); [] // ----------------------------- routes // source, destination, hops route(core_0, pm_8, switches:0,1,5,6,7,11); route(core_1, pm_9, switches:1,5,9,8); route(core_2, pm_10, switches:2,6,5,9); route(core 3 pm_11, route(core_3, pm 11 switches:3,2,6,10); switches:3 2 6 10); []
Specifies NIs (I/Os, clocks, buffers) switches (I/Os, buffers) links routes
123
124
MPARM Architecture
Reference: Bertozzi, D. and Benini, L. (2004) xpipes: A Network on-Chip Architecture for Giga Scale Systems-on-Chips, IEEE Circuits and Systems Magazine, pp. 18-31.
126
127
Reference: Carloni, L. P., Pande P. P., Yuan X. (2009) Networks-on-Chip in emerging interconnect paradigms: Advantages and Challenges ACM/IEEE Intl Symp. On Network s-on-Chip, pp. 93-102.
128
129
130
130
Aethereal
131
132
133
134
Bibliography
For detailed updated reference, the audience are directed to the following link:
http://www.cl.cam.ac.uk/~rdm34/onChipNetBib/onChipNetwork.pdf
Below we are giving some of our contributions in NoC research: [1] S. Kundu and S. Chattopadhyay, Interfacing Cores and Routers in Network-on-Chip Using GALS, IEEE International Symposium on Integrated Circuits (ISIC), 2007. [2] S. Kundu and S. Chattopadhyay, Mesh-of-Tree Deterministic Routing for Network-on-Chip Architecture, ACM Great Lake Symposium on VLSI (GLSVLSI), (GLSVLSI) 2008. 2008 [3] S. Kundu, R. P. Dasari, K. Manna, and S. Chattopadhyay, Mesh-of-Tree based scalable Network-on-Chip Architecture, IEEE Region 10 Colloquium and International Conference on Industrial and Information Systems (ICIIS), 2008. [4] S. Kundu and S. Chattopadhyay, Mesh-of-Tree based Network-on-Chip Architecture Using Virtual Channel based Router IEEE VLSI Design and Test Conference (VDAT), 2008. [5] S. Kundu and S. Chattopadhyay, Network-on-chip architecture design based on mesh-of-tree deterministic routing topology. International Journal for High Performance Systems Architecture, Vol. 1, No. 3, pp.163182, Inderscience Publisher, 2008. [6] S. Kundu, d , R. P. Dasari, , K. Manna, , and d S. Chattopadhyay, p d y y, Performance Evaluation of Mesh-of-Tree Based d Network-on-Chip Using Wormhole Router with Poisson Distributed Traffic, IEEE VLSI Design and Test Conference (VDAT), 2009. [7] S. Kundu, K. Manna, S. Gupta, K. Kumar, R. Parikh, and S. Chattopadhyay, A Comparative Performance Evaluation Of Network-on-Chip Architectures Under Self-Similar Traffic, IEEE International Conference on Ad Advances i Recent in R t Technologies T h l i in i Communication C i ti and d Computing C ti (ARTCom), (ARTC ) 2009. 2009
135
Th You Thank Y