Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/224205744 Multi-application multi-step mapping method for many-core Network-on-Chips Conference Paper · December 2010 DOI: 10.1109/NORCHIP.2010.5669454 · Source: IEEE Xplore CITATIONS READS 6 55 6 authors, including: Bo Yang Thomas Canhao Xu 14 PUBLICATIONS 38 CITATIONS 51 PUBLICATIONS 217 CITATIONS University of Turku SEE PROFILE University of Turku SEE PROFILE Tero Säntti Juha Plosila 47 PUBLICATIONS 100 CITATIONS 343 PUBLICATIONS 1,637 CITATIONS University of Turku SEE PROFILE University of Turku SEE PROFILE All content following this page was uploaded by Juha Plosila on 04 December 2016. The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document and are linked to publications on ResearchGate, letting you access and read them immediately. Multi-Application Multi-Step Mapping Method for Many-Core Network-on-Chips Bo Yang∗ , Liang Guang∗‡ , Thomas Canhao Xu∗‡ , Alexander Wei Yin∗‡ , Tero Säntti∗† , Juha Plosila∗† ∗ Department † Academy of Information Technology, University of Turku, Finland of Finland, Research Council for Natural Sciences and Engineering ‡ Turku Center for Computer Science, Turku, Finland {boyan, liagua, canxu, yinwei, teansa, juplos}@utu.fi Abstract—Massive parallel computing performed on manycore Network-on-Chips (NoCs) is the future of the computing. One feasible approach to implement parallel computing is to deploy multiple applications on the NoC simultaneously. In this paper, we propose a multi-application mapping method starting with the application mapping which finds a region on the NoC for each application and then task mapping which maps all tasks of the application into each region. In the application mapping step, several strategies based on the maximal empty rectangle (MER) technique are introduced for finding an optimal region for each application. In the task mapping step, a treemodel based algorithm is used with the purpose of reducing the communication latency and energy consumption. The experiment results show that the proposed method can achieve considerable reduction of network latency and energy consumption (up to 18%) for a given set of applications. I. I NTRODUCTION Over the last 40 years, we have witnessed a series of remarkable developments in computer industry. One of them is the increasing processing capability of the system. The increase is not only achieved by the performance improvements between the generations of uniprocessors, but also comes from the advent of multi-core or many-core architectures where tens to hundreds of processors or cores can be integrated on a single chip. Examples of such architectures are [6] and [17]. A recent study at the University of California, Berkeley [1] suggests that it will soon be possible to integrate more than 1000 cores on a single chip since Moore’s Law is still generously delivering transistors at the rate of twice every couple of years. While the amount of on-chip cores increases, the communication among them is critical to the system performance and energy consumption. In the last decade, NoC has been proposed as an alternative for the traditional bus and point-to-point adhoc connections in order to address the challenge of increasing concurrent communication requirements as well as the difficulty of global synchronization [4]. Based on the NoC platforms, a large body of researches addressing the mapping problem has been undertaken in the last couple of years [9] [12] [10] [13]. In [9], Hu et al. presented a branch and bound algorithm which maps the tasks of a single application to nodes and generates a suitable deadlock-free routing function such that the total communication energy consumption is minimized under specified performance constraints. Several well used task mapping algo- 978-1-4244-8973-2/10/$26.00 c 2010 IEEE rithms in literature are analyzed and compared in [12]. Tang et al. proposed a two-step genetic algorithm and the related software for mapping concurrent applications on a fixed NoC architecture [10]. Murali et al. presented a methodology to map multiple use-cases onto the NoC architecture, satisfying the constraints of each use-case [13]. In these works multiple applications reuse the same platforms in different time slots. The main drawback of these systems is the timing overhead incurred by reconfiguring the NoC and loading new applications. Also, since various communication constraints and traffic characteristics of applications have to be satisfied using limited processing elements (PEs), the system design is more complicated and the optimized mapping for each application may not be achieved [13]. While the traditional approaches to maximize the serial performance of processors by maximizing the clock speed and increasing instruction-level parallelism (ILP) are proved to reach their limits [7] [8], the many-core NoC architectures provide more feasibility to deliver higher performance through parallel computing. The massive parallel computing performed on many-core NoCs is the future of computing [1]. With increasing number and computational power of on-chip PEs, the parallel computing on many-core NoCs can be realized no only at the instruction level, but also at the higher task and/or application levels. To realize the higher level parallelism and make full use of the abundant resources on the NoCs, it is no longer reasonable to only focus on the implementation of single application with abundant PEs being available on the many-core NoCs. Instead, the design focus should shift from the single-application to the multi-application scenarios. More precisely, multiple applications could be deployed on different regions of the NoC and executed in parallel. In this paper, we propose a novel mapping method whereby multiple applications can be simultaneously mapped on the many-core NoCs. The mapping method consists of application mapping and task mapping. The two-step mapping method first finds a region on the NoC for each application and then maps all tasks of the application into the region. Several strategies based on the MER technique are introduced for finding an objective MER for each application. Following the application mapping, a tree-model based algorithm is used to map all tasks of the application into the objective MER. By optimizing the layout of both multiple applications and tasks within applications, the proposed method aims at achieving lower network latency and energy consumption for multiple applications on the many-core NoCs. II. P ROBLEM F ORMULIZATION A. System Model The target system is shown in Figure 1, consisting of a Real-time Operating System (RTOS) and a NoC platform. The NoC provides the computation and communication resources to implement multiple applications. The RTOS schedules the given set of applications (e.g. A1 to A6 in Figure 1) and manage the resources on the NoC. The mapper runs the proposed mapping algorithm to map each application on a feasible region and the loader loads all tasks on PEs according to the mapping solution. This work deals with on-line scenarios, i.e., the RTOS does not know in advance when each application arrives and how much PEs they need. In this paper, we focus on the mapping algorithm of the mapper. Fig. 1: System Model B. Problem Description In the single-application mapping scenarios, the mapping problem is how to find an appropriate position for each task of the application subject to particular performance or cost metrics. In the multi-application scenarios, the problem is extended to search for the optimal positions for both the applications and tasks of the individual application. We first give the definitions regarding the target application and NoC architecture used in this paper. Definition 1: We assume that each application has already been implemented as a set of tasks. The application is modeled by a task graph (TG). A TG is a directed graph TG = < T, C >, where T = {t1 , t2 , . . . , tp } represents the set of tasks, corresponding the set of TG vertices, and C = {(ti , tj , wij )} denotes the set of communications between tasks, corresponding to the set of TG edges. The edge weight wij in (ti , tj , wij ) represents the total data amount, sent from ti to tj . The number of tasks p in TG is denoted as the size of the given application. Definition 2: A NoC is modeled as a communication resource graph (CRG). A CRG is a directed graph CRG = < N, L >, where N = {n1 , n2 , . . . , nq } denotes the set of nodes on the NoC, corresponding to the set of CRG vertices, and L = {(ni , nj , |lij |)} designates the set of routing path from node ni to node nj , corresponding to the edges of CRG. |lij | represents the communication length from node ni to node nj . The number of nodes q in CRG is denoted as the size of the NoC. For the sake of simplicity, in this paper, the NoC is assumed to be a homogeneous 2-D using deterministic X-Y routing strategy. Using these definitions, the problem of the multi-application mapping can be described as follows: Given a set of TGs and a CRG, find a mapping area (MA) on CRG for each TG which can accommodate all tasks of the TG, also find a position within the MA for each task such that the lowest overall network delay and communication energy consumption can be achieved for the give set of TGs. C. Objective Formulization Since the network delay is proportional to the communication distance between the source and destination nodes on the NoC, one feasible way to reduce network delay is to shorten the communication distance among tasks as much as possible. This can be achieved in the process of finding the optimal MA for an application. We use the nodes average distance (NAD) mentioned in [10] to evaluate the average communication distance within the MA. NAD is defined as the average distance between two randomly selected nodes in NoC architecture. For a X × Y mesh NoC, the NAD is:   1 X +Y (1) × 1− N AD = 3 X ×Y The Equation (1) implies that for a given application, the average communication distance among tasks varies when different areas are used to map the tasks of the application. The more compact the area is, the smaller NAD it achieves. The energy consumption of a communication between tasks ti and tj is determined by both the communication weight wij and the distance |lij |. To reduce the communication energy consumption, minimizing the weighted communication of the application (WCA) has been proved to be efficient [18]. The WCA is defined as the sum of products of the wij and |lij | for all communications in an application as follows: X W CA = wij × |lij | (2) ∀i,j Based on these formulizations, the objectives of the proposed method are transformed into seeking the most compact mapping area MA with smallest NAD and the optimized task mapping solution with minimized WCA. III. M ULTI -A PPLICATION M ULTI -S TEP M APPING To reach the two goals mentioned in the previous section, we propose a two-step multi-application mapping method. The mapping consists of two sequential phases: application mapping (AM) and task mapping (TM). AM deals with the mapping of multiple applications and its purpose is to optimize the layout of multiple applications mapped on the NoC and find the optimal MA with the minimal NAD for each application. TM works after AM to conduct the task mapping of an individual application and achieve the minimized WCA. A. Application Mapping (AM) On a 2-D mesh NoC, any sub-mesh or rectangle can be regarded as a piece of compact area. Thus, the problem of AM is turned into the problem of managing the rectangles on the NoC. To do this, AM adopts the concept of maximal empty rectangle (MER), which was originally used to solve the placement problem in FPGA design [2]. 1) MER Technique: A MER is a empty rectangle that is not contained by any other empty rectangles. In our case, a MER represents a cluster of free nodes on the NoC that is used to map an application. Figure 2 shows an example of application mapping using the MER technique. At first, the whole surface of the NoC is represented by one MER R0 (Figure 2a). After the mapping of application A1 , the R0 is split into R1 and R2 (Figure 2b). In Figure 2c, the R1 is further fragmented into R3 and R4 after the application A2 has been mapped. The MERs R2 , R3 and R4 can be used for the future application mapping. Let w(R) and h(R) be the width and height of the MER R, the normalized aspect ratio A(R) of the MER R is defined as: max{w(R), h(R)} (3) A(R) = min{w(R), h(R)} The aspect ratio A(R) implies the shape of the MER. If it equals 1, the MER is a square. Otherwise, it is a standard rectangle. (a) (b) (c) Fig. 2: Application Mapping Using MER 2) Objective MER Selection: For a given application with the size p, AM tries to find an optimal or near-optimal objective MER Rm to map the application. Based on the state of MERs on the NoC, the cases that AM possibly faces are: (1) the total amount of PEs in all MERs is not adequate to accommodate the given application; (2) there is at least one candidate MER that can accommodate the given application; (3) the total amount of PEs in all MERs is adequate to accommodate the given application, but neither of them can fit the application alone. In the first case, the mapping request will be rejected at this time and the RTOS can try the mapping later. For the second case, we propose the following strategies for finding the objective MER. • Best Size (BS): BS chooses the candidate MER with the smallest size as the objective MER Rm . Intuitively, this strategy tries to keep the big rectangles for the future application mapping. • Best Shape (BSh): It is noteworthy in Equation (1) that, an area with the same width X and height Y holds the minimal NAD among all areas with size X × Y . Taking this into consideration, BSh strategy chooses the candidate MER with the minimal A(R) as the objective MER Rm . The reason behinds BSh is that in such a MER, the application is more likely to be mapped in a area close to square so that a smaller NAD can be achieved. • Best Size Best Shape (BSBSh): BSBSh is extended from BS. If there are several candidate MERs with the same smallest size, the one with minimal A(R) is selected. Best Shape Best Size (BShBS): Similar to the previous one, among all candidate MERs with the same minimal A(R), the one with smallest size is selected. Whenever an objective MER Rm is selected, AM will choose a mapping area MA with minimal A(M A) in Rm to map the given application. In this paper, we define the corner of the objective MER Rm which is closest to any corner of the NoC as the starting point to create the MA. The reasons behind this include to reduce fragmentations along the borders of the NoC as well as to reduce the congestion in the middle area of the NoC by leaving free MERs there. The created area MA will be returned as an input for TM phase. To deal with the third case, the LS+C strategy is applied. • Largest Size + Combining (LS+C): In this case, the application has to be mapped on separate MERs. To avoid increasing communication cost between more distant MERs with small size , LS+C chooses the free MER with largest number of PEs as the primary area and then combines the nearest free MERs to get adequate PEs for the application. The combined mapping area MA is returned as an input for TM phase. 3) MER Merging: When the execution of an application completes, the area occupied by the application can be released and merged with neighboring free MERs to get larger MERs for the future mappings. Combining these techniques and strategies together, the algorithm of AM is described as Algorithm 1. • Algorithm 1: Multi-Application Mapping Input : TGs: a set of applications, CRG: a 2-D mesh with size W × H Output: The mapping areas for applications in A 1 2 3 4 5 6 7 8 9 10 Initiate the original MERs list R0 with size W × H. if the free PEs on the NoC can not accommodate the arriving application Ai then Reject the mapping request. else if More than one MER can accommodate Ai then Use appropriate strategy to select one objective MER and create the mapping area MA. else Use the LS+C strategy to find a mapping area MA. if application Aj is completed then Merge the area occupied by Aj with neighboring free MERs; Repeat 2-9 until MA for each application is found. Figure 3 is an example of the application mapping using Algorithm 1. Four applications with size 25, 16, 16, 9 used in the experiment in Section IV, denoted as FFT(25), X264(16), TPCH(16) and FFT(9) respectively, are mapped sequentially on a NoC with size 10 × 7. Figure 3a and 3b are the final mapping under the BS and BSh strategy respectively. The main difference of these two mapping results is the transposed locations of application X264(16) and TPCH(16). Under both strategies, the LS+C strategy is used for the application FFT(9). (a) BS Mapping (b) BSh Mapping Fig. 3: Application Mapping Using BS and BSh (a) BSh Mapping for TPCH(16) The major responsibility of the AM algorithm is to manage the MERs list. As  mentioned in [2], the algorithm of managing MERs is O n2 for n mapped applications. C. Task Mapping (TM) After the mapping area MA for a given application has been obtained in the AM phase, the role of TM is to map the tasks of the application with the purpose of minimizing the W CA of the application. To address the task mapping, we propose a tree-model based mapping algorithm. The mapping algorithm consists of two parts: the abstraction of a mapping area MA into an extended tree structure and the mapping of an application onto the extended tree. Figure 5 is an example of mapping the tasks of an application A2 (shown in Figure 2c) on the selected MA. (b) WNAD Mapping Fig. 4: Application Mapping Using WNAD B. Weighted NAD ! In Algorithm 1, the MERs which can’t accommodate the given application would not be selected as an objective MER Rm as long as there are candidate MERs, although some of them hold a smaller A(R) than the selected Rm and can accommodate most tasks of the application. Figure 4a is an example of application mapping under the BSh strategy. After the application FFT(25) and X264(16) have been mapped, the candidate MER R1 is selected (shown in Figure 3b), although the non-candidate MER R2 with the better shape and close size (15) for the application TPCH(16). This is because the combination of several separated MERs is likely to induce higher NAD and WCA than a monolithic MER. However, if the task mapping algorithm presented in the following section is taken into account, it is reasonable to accept some noncandidate MER as the objective MER on which most of the tasks are able to be mapped. Since the task mapping always chooses the task which affects the WCA most and maps it prior to other tasks, the last selected tasks have limited impact on the overall WCA even if they are mapped on separate MERs. Therefore, we propose another strategy for the objective MER selection, termed as weighted NAD (WNAD). The WNAD of a MER is defined as follows: W N AD = Ntasks × N AD Nnodes (4) where the first factor is the weighted ratio. Ntasks is the number of tasks in the application. Nnodes is the number of nodes occupied by the tasks if the application is mapped on the MER. For a candidate MER, the weighted ratio equals to 1 and the WNAD strategy is equivalent to the BSh strategy. The MER with a lower WNAD can accommodate more tasks with a smaller NAD. Using the WNAD strategy, both the candidate and non-candidate MERs presented in the previous strategies can be evaluated together to find the objective MER. The Figure 4b is an example of using WNAD strategy to map the same set of applications as in Figure 3. Fig. 5: Tree-Model Based Task Mapping 1) Tree Model of MA: The abstraction of a MA into an extended tree structure follows Algorithm 2. Simply put, the center point of the MA is chosen as the root node of the tree, which has the shortest average distance to other nodes in the MA. The neighbors of the center point are put as the children nodes of the root node. The procedure continues until all nodes in the MA are put onto the tree (bottom right of Figure 5). The structure is called an extended tree since some children may have more than one parent node. This extended tree structure places the network nodes with shorter average distance (to other nodes) onto higher-level tree nodes. Intuitively, task in the application with a large communication volume should be placed on as high level on the tree as possible, in order to minimize the total communication cost which is proportional to the average communication distance. Algorithm 2: Tree Abstraction Algorithm Input : mapping area MA Output: An extended tree abstraction 1 2 3 Select the center network node as the root node in the tree; Traverse the NoC from the center node, record all its neighbors as the child nodes; Repeat 2 for each child node until all nodes are in the tree. 2) TM Algorithm: The mapping of applications onto the tree follows Algorithm 3. We calculate the communication volume (CV , Definition 3) of each task in the task graph, and place the task with the largest communication volume CVti is the communication volume of ti . ′ Definition 4: Let T be the set of mapped tasks on a tree, and ti be a task not yet mapped, then X (wij + wji ) AP Tti = ∀tj ∈T ′ AP Tti is the affinity of ti to the partial (mapped) tree. Algorithm 3: Task Mapping Algorithm Input : TG, Abstracted Tree of MA Output: Task Mapping on the Tree 1 2 3 Calculate CV for all tasks, and map the task with the largest CV onto the root node; Calculate AP T of all non-mapped tasks, and map the task with the largest AP T to the highest level tree node available; repeat 2 until all tasks have been mapped. The tree-model based mapping has low complexity and high efficiency. For instance, compared to the greedy incremental (GI) algorithm presented in [12], the tree-based mapping has an algorithm complexity of O(N ), where N is the number of tasks in TG, while the GI algorithm has an algorithm complexity of O(N 2 ). By mapping tasks starting from the root of the tree, the algorithm minimizes the W CA using the AP T method and consequently reduces the energy consumption and network delay. IV. EXPERIMENT A. Experiment Setup Full system simulations were performed to evaluate the proposed method under different mapping strategies. Since the comparison in [12] shows that the GI algorithm achieves good results compared with some other algorithms, the GI algorithm was chosen as a reference to evaluate the tree-model based algorithm used in task mapping. The tree-model based and GI algorithms were used together with the BS, BSh and WNAD strategies of application mapping to compare the performance of these strategies. Four benchmark applications were selected, three of them are from the SPLASH-2 [15] and PARSEC [5] suite: FFT with 25 and 9 cores (FFT(25) and FFT(9) respectively), X264 with 16 cores (X264(16)). Another TPCH with 16 cores (TPCH(16)) is an ad-hoc, decision support benchmark from TPC [16]. The mapping was conducted in order of FFT(25), X264(16), TPCH(16) and FFT(9). B. Results Analysis Average Communication Distance (ACD:hops) ∀tj ∈T A cycle-accurate NoC simulator, Noxim [14], was extended and used to simulate the four applications’ traffics on a 10 × 7 NoC and produce network delay and communication energy consumption under different mapping strategies. The workload traces of these four applications were gathered from Simics [11] where the NoC was configured to model a chip multiprocessor (CMP). Each PE has a core, a private L1 cache and a shared L2 cache bank. Memory controllers are connected to the top and bottom side of the chip. The static non uniform cache architecture (NUCA) [3] is implemented in our memory/cache architecture, in which data are mapped to cache banks statically. The average communication distance (ACD), WCA, average network latency (ANL), energy consumption (EC) under the different strategies were compared. The ACD is the average communication distance among all tasks when the application is mapped on the selected objective MER. The ANL is the average number of cycles needed for transferring one packet on the NoC. 4 BS+GI BS+Tree BSh+GI BSh+Tree WNAD+GI WNAD+Tree 3.5 3 2.5 2 1.5 1 0.5 0 FFT(25) X264(16) TPCH(16) FFT(9) Fig. 6: ACD Using Different Strategies Weighted Communcation of Application(WCA) onto the root node in the tree. Then we calculate the weighted communication volumes of the remaining tasks to the ones already mapped onto the tree, termed as affinity to partial tree (AP T , Definition 4), and place the task with the largest AP T to the highest node available in the tree. This procedure iterates until all tasks have been mapped onto the tree. Definition 3: Let ti be a task in the task graph TG, and cij be the communication volume from ti to tj , then X CVti = (wij + wji ) 100% BS+GI BS+Tree BSh+GI BSh+Tree WNAD+GI WNAD+Tree 80% 60% 40% 20% 0% FFT(25) X264(16) TPCH(16) FFT(9) Fig. 7: WCA Using Different Strategies Figure 6 shows the ACD for each application under different strategies. BS+GI and BS+Tree respectively represent the cases where BS is applied to the application mapping and the GI and tree-model based algorithm to the task mapping, and so forth. The variant ACDs of the X264, TPCH and FFT(9) under different strategies show the impact of objective MER on the ACD. For each of them, the optimal ACD is achieved when they are mapped on an objective MER with minimal aspect ratio A(R), or intuitively, a rectangle close to square. Non-optimized mappings result highest ACD for X264 (3.14, 25% higher than the optimal 2.49), TPCH (3.14, 25% higher than the optimal 2.49) and FFT (9) (2.49, 23% higher than the optimal 2.03). For most applications, the WNAD can obtain the same or better solution than that of BS and BSh strategy. The only exception is the case of TPCH where one primary MER combing another MER is selected for the mapping under the WNAD strategy, instead of a monolithic MER under the BS and BSh strategy. This does prove the negative impact of separate MERs on the ACD. Also note, the task mapping algorithm has negligible impact on the ACD. The normalized WCA of each application under different strategies is displayed in Figure 7. The impact of the application mapping on the WCA keeps consistent with that on the ACD (shown in Figure 6). However, it is noteworthy that the task mapping has a great impact on the WCA. In all cases, the tree-model based algorithm outperforms the GI algorithm. For example, using the WNAD+Tree strategy, the tree-model based algorithm achieves 35%, 17%, 16% and 32% lower WCA than the GI algorithm for each application. Furthermore, WNAD+Tree contributes the lowest WCA for each application among all strategies. Average Network Latency (ANL) 100% GI Tree−Model Based 80% 60% 40% 20% 0% BS BSh WNAD Fig. 8: ANL Using Different Strategies Energy Consumption (EC) 100% GI Tree−Model Based 80% 60% 40% 20% 0% BS BSh WNAD Fig. 9: EC Using Different Strategies The normalized simulation results of the ANL and the EC are demonstrated in Figure 8 and 9. As anticipated by the WCA in Figure 7, the tree-model based algorithm achieves lower ANL and EC than the GI algorithm. The ANL of treemodel based algorithm is 12%, 15%, 13% lower than that of the GI under BS, BSh and WNAD strategies respectively. The same achievements keeps for the EC. Furthermore, WNAD strategy outperform the BS and BSh and achieves lowest ANL and EC (about 5% lower in average ). For this set of applications, the difference between BS and BSh is negligible with respect to the ANL and EC. The lowest ANL and EC are achieved by WNAD+Tree which is 18% lower compared with the worst case under BSh+GI strategy. V. C ONCLUSION An innovative method for multiple applications mapping on the future many-core NoC is proposed. The two-step mapping method first finds a region on the NoC for a given application and then maps all tasks of the application into the region. Several strategies based on the MER technique, e.g. BS, BSh and WNAD are introduced for the application mapping. By using these strategies, the algorithm can efficiently find the optimal objective MER to map the target application. Following the application mapping, a tree-model based algorithm is proposed for the task mapping and compared against an existing GI algorithm. The experiment shows that in a common case, the MER with minimal aspect ratio is ideal for mapping a given application. Among the proposed strategies for application mapping, the WNAD is likely to obtain a better solution than BS and BSh. For the task mapping, the proposed tree-model based algorithm outperforms the GI algorithm on achieving lower network latency and energy consumption. WNAD+Tree strategy achieves lowest network latency and energy consumption among all strategies. VI. ACKNOWLEDGEMENT The authors would like to thank the Academy of Finland for the financial support for this work. R EFERENCES [1] Krste Asanovic, Ras Bodik, Bryan C. Catanzaro, Joseph J. Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William L. Plishker, John Shalf, Samuel W. Williams, and Katherine A. Yelick. The landscape of parallel computing research: a view from berkeley. (UCB/EECS-2006183), December 2006. [2] K. Bazargan, R. Kastner, and M. Sarrafzadeh. Fast template placement for reconfigurable computing systems. Design Test of Computers, IEEE, 17(1):68 –83, jan-mar 2000. [3] Bradford M. Beckmann and David A. Wood. Managing wire delay in large chip-multiprocessor caches. In Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, pages 319– 330, December 2004. [4] L. Benini and G. De Micheli. Networks on chips: a new soc paradigm. Computer, 35(1):70–78, Jan 2002. [5] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The parsec benchmark suite: characterization and architectural implications. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 72–81, October 2008. [6] M. Denneau and H. S Warren, Jr. 64-bit cyclops: Principles of operation. IBMTech-report, 2005. [7] P.P. Gelsinger. Microprocessors for the new millennium: Challenges, opportunities, and new frontiers. In Proceedings of The International Solid State Circuits Conference (ISSCC), pages 22–25, 2001. [8] J. Hennessy and D. Patterson. Computer Architecture: A Quantitative Application, 4th Edition. Morgan Kauffman, 2007. [9] Radu Marculescu Jingcao Hu. Energy- and performance-aware mapping for regular noc architecture. IEEE Transations On Computer-Aided Design of Integrated Circuits and Systems, Vol.24, No.4:551–562, 2005. [10] Tang Lei and Shashi Kumar. A two-step genetic algorithm for mapping task graphs to a network on chip architecture. In DSD ’03: Proceedings of the Euromicro Symposium on Digital Systems Design, page 180, Washington, DC, USA, 2003. IEEE Computer Society. [11] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. Computer, 35(2):50–58, February 2002. [12] C.A.M. Marcon, E.I. Moreno, N.L.V. Calazans, and F.G. Moraes. Evaluation of algorithms for low energy mapping onto nocs. In Proc. IEEE International Symposium on Circuits and Systems ISCAS 2007, pages 389–392, 2007. [13] Srinivasan Murali, Martijn Coenen, Andrei Radulescu, Kees Goossens, and Giovanni De Micheli. Mapping and configuration methods for multiuse-case networks on chips. In ASP-DAC ’06: Proceedings of the 2006 Asia and South Pacific Design Automation Conference, pages 146–151, Piscataway, NJ, USA, 2006. IEEE Press. [14] University of Catania. Noxim. http://www.noxim.org/. [15] Jaswinder Pal Singh, Anoop Gupta, Moriyoshi Ohara, Evan Torrie, and Steven Cameron Woo. The splash-2 programs: Characterization and methodological considerations. Computer Architecture, International Symposium on, 0:24, 1995. [16] TPC. Tpc-h. http://www.tpc.org/tpch/. [17] S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar. An 80-tile sub-100-w teraflops processor in 65-nm cmos. Solid-State Circuits, IEEE Journal of, 43(1):29–41, 2008. [18] Bo Yang, Thomas Canhao Xu, Tero Santti, and Juha Plosila. Tree-model based mapping for energy-efficient and low-latency network-on-chip. In Design and Diagnostics of Electronic Circuits and Systems (DDECS), pages 189 –192, 14-16 2010.