See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/224205744
Multi-application multi-step mapping method
for many-core Network-on-Chips
Conference Paper · December 2010
DOI: 10.1109/NORCHIP.2010.5669454 · Source: IEEE Xplore
CITATIONS
READS
6
55
6 authors, including:
Bo Yang
Thomas Canhao Xu
14 PUBLICATIONS 38 CITATIONS
51 PUBLICATIONS 217 CITATIONS
University of Turku
SEE PROFILE
University of Turku
SEE PROFILE
Tero Säntti
Juha Plosila
47 PUBLICATIONS 100 CITATIONS
343 PUBLICATIONS 1,637 CITATIONS
University of Turku
SEE PROFILE
University of Turku
SEE PROFILE
All content following this page was uploaded by Juha Plosila on 04 December 2016.
The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document
and are linked to publications on ResearchGate, letting you access and read them immediately.
Multi-Application Multi-Step Mapping Method
for Many-Core Network-on-Chips
Bo Yang∗ , Liang Guang∗‡ , Thomas Canhao Xu∗‡ , Alexander Wei Yin∗‡ , Tero Säntti∗† , Juha Plosila∗†
∗ Department
† Academy
of Information Technology, University of Turku, Finland
of Finland, Research Council for Natural Sciences and Engineering
‡ Turku Center for Computer Science, Turku, Finland
{boyan, liagua, canxu, yinwei, teansa, juplos}@utu.fi
Abstract—Massive parallel computing performed on manycore Network-on-Chips (NoCs) is the future of the computing.
One feasible approach to implement parallel computing is to
deploy multiple applications on the NoC simultaneously. In this
paper, we propose a multi-application mapping method starting
with the application mapping which finds a region on the NoC
for each application and then task mapping which maps all
tasks of the application into each region. In the application
mapping step, several strategies based on the maximal empty
rectangle (MER) technique are introduced for finding an optimal
region for each application. In the task mapping step, a treemodel based algorithm is used with the purpose of reducing the
communication latency and energy consumption. The experiment
results show that the proposed method can achieve considerable
reduction of network latency and energy consumption (up to
18%) for a given set of applications.
I. I NTRODUCTION
Over the last 40 years, we have witnessed a series of remarkable developments in computer industry. One of them is the
increasing processing capability of the system. The increase is
not only achieved by the performance improvements between
the generations of uniprocessors, but also comes from the
advent of multi-core or many-core architectures where tens to
hundreds of processors or cores can be integrated on a single
chip. Examples of such architectures are [6] and [17]. A recent
study at the University of California, Berkeley [1] suggests that
it will soon be possible to integrate more than 1000 cores on
a single chip since Moore’s Law is still generously delivering
transistors at the rate of twice every couple of years. While
the amount of on-chip cores increases, the communication
among them is critical to the system performance and energy
consumption. In the last decade, NoC has been proposed as
an alternative for the traditional bus and point-to-point adhoc
connections in order to address the challenge of increasing
concurrent communication requirements as well as the difficulty of global synchronization [4].
Based on the NoC platforms, a large body of researches
addressing the mapping problem has been undertaken in the
last couple of years [9] [12] [10] [13]. In [9], Hu et al.
presented a branch and bound algorithm which maps the
tasks of a single application to nodes and generates a suitable
deadlock-free routing function such that the total communication energy consumption is minimized under specified
performance constraints. Several well used task mapping algo-
978-1-4244-8973-2/10/$26.00 c 2010 IEEE
rithms in literature are analyzed and compared in [12]. Tang
et al. proposed a two-step genetic algorithm and the related
software for mapping concurrent applications on a fixed NoC
architecture [10]. Murali et al. presented a methodology to
map multiple use-cases onto the NoC architecture, satisfying
the constraints of each use-case [13]. In these works multiple
applications reuse the same platforms in different time slots.
The main drawback of these systems is the timing overhead
incurred by reconfiguring the NoC and loading new applications. Also, since various communication constraints and
traffic characteristics of applications have to be satisfied using
limited processing elements (PEs), the system design is more
complicated and the optimized mapping for each application
may not be achieved [13].
While the traditional approaches to maximize the serial
performance of processors by maximizing the clock speed
and increasing instruction-level parallelism (ILP) are proved
to reach their limits [7] [8], the many-core NoC architectures
provide more feasibility to deliver higher performance through
parallel computing. The massive parallel computing performed
on many-core NoCs is the future of computing [1]. With
increasing number and computational power of on-chip PEs,
the parallel computing on many-core NoCs can be realized no
only at the instruction level, but also at the higher task and/or
application levels. To realize the higher level parallelism and
make full use of the abundant resources on the NoCs, it is
no longer reasonable to only focus on the implementation of
single application with abundant PEs being available on the
many-core NoCs. Instead, the design focus should shift from
the single-application to the multi-application scenarios. More
precisely, multiple applications could be deployed on different
regions of the NoC and executed in parallel.
In this paper, we propose a novel mapping method whereby
multiple applications can be simultaneously mapped on the
many-core NoCs. The mapping method consists of application
mapping and task mapping. The two-step mapping method
first finds a region on the NoC for each application and then
maps all tasks of the application into the region. Several
strategies based on the MER technique are introduced for
finding an objective MER for each application. Following the
application mapping, a tree-model based algorithm is used to
map all tasks of the application into the objective MER. By
optimizing the layout of both multiple applications and tasks
within applications, the proposed method aims at achieving
lower network latency and energy consumption for multiple
applications on the many-core NoCs.
II. P ROBLEM F ORMULIZATION
A. System Model
The target system is shown in Figure 1, consisting of a
Real-time Operating System (RTOS) and a NoC platform. The
NoC provides the computation and communication resources
to implement multiple applications. The RTOS schedules the
given set of applications (e.g. A1 to A6 in Figure 1) and
manage the resources on the NoC. The mapper runs the proposed mapping algorithm to map each application on a feasible
region and the loader loads all tasks on PEs according to the
mapping solution. This work deals with on-line scenarios, i.e.,
the RTOS does not know in advance when each application
arrives and how much PEs they need. In this paper, we focus
on the mapping algorithm of the mapper.
Fig. 1: System Model
B. Problem Description
In the single-application mapping scenarios, the mapping
problem is how to find an appropriate position for each task
of the application subject to particular performance or cost
metrics. In the multi-application scenarios, the problem is
extended to search for the optimal positions for both the
applications and tasks of the individual application. We first
give the definitions regarding the target application and NoC
architecture used in this paper.
Definition 1: We assume that each application has already
been implemented as a set of tasks. The application is modeled
by a task graph (TG). A TG is a directed graph TG =
< T, C >, where T = {t1 , t2 , . . . , tp } represents the set
of tasks, corresponding the set of TG vertices, and C =
{(ti , tj , wij )} denotes the set of communications between
tasks, corresponding to the set of TG edges. The edge weight
wij in (ti , tj , wij ) represents the total data amount, sent from
ti to tj . The number of tasks p in TG is denoted as the size
of the given application.
Definition 2: A NoC is modeled as a communication resource graph (CRG). A CRG is a directed graph CRG =
< N, L >, where N = {n1 , n2 , . . . , nq } denotes the set of
nodes on the NoC, corresponding to the set of CRG vertices,
and L = {(ni , nj , |lij |)} designates the set of routing path
from node ni to node nj , corresponding to the edges of CRG.
|lij | represents the communication length from node ni to node
nj . The number of nodes q in CRG is denoted as the size of
the NoC. For the sake of simplicity, in this paper, the NoC is
assumed to be a homogeneous 2-D using deterministic X-Y
routing strategy.
Using these definitions, the problem of the multi-application
mapping can be described as follows:
Given a set of TGs and a CRG, find a mapping area (MA)
on CRG for each TG which can accommodate all tasks of the
TG, also find a position within the MA for each task such that
the lowest overall network delay and communication energy
consumption can be achieved for the give set of TGs.
C. Objective Formulization
Since the network delay is proportional to the communication distance between the source and destination nodes on
the NoC, one feasible way to reduce network delay is to
shorten the communication distance among tasks as much as
possible. This can be achieved in the process of finding the
optimal MA for an application. We use the nodes average
distance (NAD) mentioned in [10] to evaluate the average
communication distance within the MA. NAD is defined as
the average distance between two randomly selected nodes in
NoC architecture. For a X × Y mesh NoC, the NAD is:
1
X +Y
(1)
× 1−
N AD =
3
X ×Y
The Equation (1) implies that for a given application, the
average communication distance among tasks varies when
different areas are used to map the tasks of the application.
The more compact the area is, the smaller NAD it achieves.
The energy consumption of a communication between tasks
ti and tj is determined by both the communication weight wij
and the distance |lij |. To reduce the communication energy
consumption, minimizing the weighted communication of the
application (WCA) has been proved to be efficient [18]. The
WCA is defined as the sum of products of the wij and |lij |
for all communications in an application as follows:
X
W CA =
wij × |lij |
(2)
∀i,j
Based on these formulizations, the objectives of the proposed method are transformed into seeking the most compact
mapping area MA with smallest NAD and the optimized task
mapping solution with minimized WCA.
III. M ULTI -A PPLICATION M ULTI -S TEP M APPING
To reach the two goals mentioned in the previous section,
we propose a two-step multi-application mapping method.
The mapping consists of two sequential phases: application
mapping (AM) and task mapping (TM). AM deals with
the mapping of multiple applications and its purpose is to
optimize the layout of multiple applications mapped on the
NoC and find the optimal MA with the minimal NAD for each
application. TM works after AM to conduct the task mapping
of an individual application and achieve the minimized WCA.
A. Application Mapping (AM)
On a 2-D mesh NoC, any sub-mesh or rectangle can be
regarded as a piece of compact area. Thus, the problem of
AM is turned into the problem of managing the rectangles
on the NoC. To do this, AM adopts the concept of maximal
empty rectangle (MER), which was originally used to solve
the placement problem in FPGA design [2].
1) MER Technique: A MER is a empty rectangle that is not
contained by any other empty rectangles. In our case, a MER
represents a cluster of free nodes on the NoC that is used to
map an application. Figure 2 shows an example of application
mapping using the MER technique. At first, the whole surface
of the NoC is represented by one MER R0 (Figure 2a). After
the mapping of application A1 , the R0 is split into R1 and R2
(Figure 2b). In Figure 2c, the R1 is further fragmented into
R3 and R4 after the application A2 has been mapped. The
MERs R2 , R3 and R4 can be used for the future application
mapping. Let w(R) and h(R) be the width and height of the
MER R, the normalized aspect ratio A(R) of the MER R is
defined as:
max{w(R), h(R)}
(3)
A(R) =
min{w(R), h(R)}
The aspect ratio A(R) implies the shape of the MER. If it
equals 1, the MER is a square. Otherwise, it is a standard
rectangle.
(a)
(b)
(c)
Fig. 2: Application Mapping Using MER
2) Objective MER Selection: For a given application with
the size p, AM tries to find an optimal or near-optimal
objective MER Rm to map the application. Based on the state
of MERs on the NoC, the cases that AM possibly faces are:
(1) the total amount of PEs in all MERs is not adequate to
accommodate the given application; (2) there is at least one
candidate MER that can accommodate the given application;
(3) the total amount of PEs in all MERs is adequate to
accommodate the given application, but neither of them can
fit the application alone.
In the first case, the mapping request will be rejected at
this time and the RTOS can try the mapping later. For the
second case, we propose the following strategies for finding
the objective MER.
• Best Size (BS): BS chooses the candidate MER with the
smallest size as the objective MER Rm . Intuitively, this
strategy tries to keep the big rectangles for the future
application mapping.
• Best Shape (BSh): It is noteworthy in Equation (1) that,
an area with the same width X and height Y holds
the minimal NAD among all areas with size X × Y .
Taking this into consideration, BSh strategy chooses the
candidate MER with the minimal A(R) as the objective
MER Rm . The reason behinds BSh is that in such a MER,
the application is more likely to be mapped in a area close
to square so that a smaller NAD can be achieved.
• Best Size Best Shape (BSBSh): BSBSh is extended from
BS. If there are several candidate MERs with the same
smallest size, the one with minimal A(R) is selected.
Best Shape Best Size (BShBS): Similar to the previous
one, among all candidate MERs with the same minimal
A(R), the one with smallest size is selected.
Whenever an objective MER Rm is selected, AM will choose
a mapping area MA with minimal A(M A) in Rm to map the
given application. In this paper, we define the corner of the
objective MER Rm which is closest to any corner of the NoC
as the starting point to create the MA. The reasons behind this
include to reduce fragmentations along the borders of the NoC
as well as to reduce the congestion in the middle area of the
NoC by leaving free MERs there. The created area MA will
be returned as an input for TM phase.
To deal with the third case, the LS+C strategy is applied.
• Largest Size + Combining (LS+C): In this case, the
application has to be mapped on separate MERs. To avoid
increasing communication cost between more distant
MERs with small size , LS+C chooses the free MER
with largest number of PEs as the primary area and then
combines the nearest free MERs to get adequate PEs
for the application. The combined mapping area MA is
returned as an input for TM phase.
3) MER Merging: When the execution of an application
completes, the area occupied by the application can be released
and merged with neighboring free MERs to get larger MERs
for the future mappings.
Combining these techniques and strategies together, the
algorithm of AM is described as Algorithm 1.
•
Algorithm 1: Multi-Application Mapping
Input : TGs: a set of applications, CRG: a 2-D mesh
with size W × H
Output: The mapping areas for applications in A
1
2
3
4
5
6
7
8
9
10
Initiate the original MERs list R0 with size W × H.
if the free PEs on the NoC can not accommodate the
arriving application Ai then
Reject the mapping request.
else if More than one MER can accommodate Ai then
Use appropriate strategy to select one objective MER
and create the mapping area MA.
else
Use the LS+C strategy to find a mapping area MA.
if application Aj is completed then
Merge the area occupied by Aj with neighboring free
MERs;
Repeat 2-9 until MA for each application is found.
Figure 3 is an example of the application mapping using
Algorithm 1. Four applications with size 25, 16, 16, 9 used in
the experiment in Section IV, denoted as FFT(25), X264(16),
TPCH(16) and FFT(9) respectively, are mapped sequentially
on a NoC with size 10 × 7. Figure 3a and 3b are the final
mapping under the BS and BSh strategy respectively. The
main difference of these two mapping results is the transposed
locations of application X264(16) and TPCH(16). Under both
strategies, the LS+C strategy is used for the application
FFT(9).
(a) BS Mapping
(b) BSh Mapping
Fig. 3: Application Mapping Using BS and BSh
(a) BSh Mapping for TPCH(16)
The major responsibility of the AM algorithm is to manage
the MERs list. As
mentioned in [2], the algorithm of managing
MERs is O n2 for n mapped applications.
C. Task Mapping (TM)
After the mapping area MA for a given application has been
obtained in the AM phase, the role of TM is to map the
tasks of the application with the purpose of minimizing the
W CA of the application. To address the task mapping, we
propose a tree-model based mapping algorithm. The mapping
algorithm consists of two parts: the abstraction of a mapping
area MA into an extended tree structure and the mapping of
an application onto the extended tree. Figure 5 is an example
of mapping the tasks of an application A2 (shown in Figure
2c) on the selected MA.
(b) WNAD Mapping
Fig. 4: Application Mapping Using WNAD
B. Weighted NAD
!
In Algorithm 1, the MERs which can’t accommodate the
given application would not be selected as an objective MER
Rm as long as there are candidate MERs, although some of
them hold a smaller A(R) than the selected Rm and can
accommodate most tasks of the application. Figure 4a is an
example of application mapping under the BSh strategy. After
the application FFT(25) and X264(16) have been mapped, the
candidate MER R1 is selected (shown in Figure 3b), although
the non-candidate MER R2 with the better shape and close
size (15) for the application TPCH(16). This is because the
combination of several separated MERs is likely to induce
higher NAD and WCA than a monolithic MER. However, if
the task mapping algorithm presented in the following section
is taken into account, it is reasonable to accept some noncandidate MER as the objective MER on which most of the
tasks are able to be mapped. Since the task mapping always
chooses the task which affects the WCA most and maps it prior
to other tasks, the last selected tasks have limited impact on
the overall WCA even if they are mapped on separate MERs.
Therefore, we propose another strategy for the objective MER
selection, termed as weighted NAD (WNAD). The WNAD of
a MER is defined as follows:
W N AD =
Ntasks
× N AD
Nnodes
(4)
where the first factor is the weighted ratio. Ntasks is the
number of tasks in the application. Nnodes is the number of
nodes occupied by the tasks if the application is mapped on
the MER. For a candidate MER, the weighted ratio equals to
1 and the WNAD strategy is equivalent to the BSh strategy.
The MER with a lower WNAD can accommodate more tasks
with a smaller NAD. Using the WNAD strategy, both the
candidate and non-candidate MERs presented in the previous
strategies can be evaluated together to find the objective MER.
The Figure 4b is an example of using WNAD strategy to map
the same set of applications as in Figure 3.
Fig. 5: Tree-Model Based Task Mapping
1) Tree Model of MA: The abstraction of a MA into an
extended tree structure follows Algorithm 2. Simply put, the
center point of the MA is chosen as the root node of the tree,
which has the shortest average distance to other nodes in the
MA. The neighbors of the center point are put as the children
nodes of the root node. The procedure continues until all nodes
in the MA are put onto the tree (bottom right of Figure 5). The
structure is called an extended tree since some children may
have more than one parent node. This extended tree structure
places the network nodes with shorter average distance (to
other nodes) onto higher-level tree nodes. Intuitively, task in
the application with a large communication volume should be
placed on as high level on the tree as possible, in order to
minimize the total communication cost which is proportional
to the average communication distance.
Algorithm 2: Tree Abstraction Algorithm
Input : mapping area MA
Output: An extended tree abstraction
1
2
3
Select the center network node as the root node in the
tree;
Traverse the NoC from the center node, record all its
neighbors as the child nodes;
Repeat 2 for each child node until all nodes are in the
tree.
2) TM Algorithm: The mapping of applications onto the
tree follows Algorithm 3. We calculate the communication
volume (CV , Definition 3) of each task in the task graph,
and place the task with the largest communication volume
CVti is the communication volume of ti .
′
Definition 4: Let T be the set of mapped tasks on a tree,
and ti be a task not yet mapped, then
X
(wij + wji )
AP Tti =
∀tj ∈T ′
AP Tti is the affinity of ti to the partial (mapped) tree.
Algorithm 3: Task Mapping Algorithm
Input : TG, Abstracted Tree of MA
Output: Task Mapping on the Tree
1
2
3
Calculate CV for all tasks, and map the task with the
largest CV onto the root node;
Calculate AP T of all non-mapped tasks, and map the
task with the largest AP T to the highest level tree node
available;
repeat 2 until all tasks have been mapped.
The tree-model based mapping has low complexity and high
efficiency. For instance, compared to the greedy incremental
(GI) algorithm presented in [12], the tree-based mapping has
an algorithm complexity of O(N ), where N is the number
of tasks in TG, while the GI algorithm has an algorithm
complexity of O(N 2 ). By mapping tasks starting from the root
of the tree, the algorithm minimizes the W CA using the AP T
method and consequently reduces the energy consumption and
network delay.
IV.
EXPERIMENT
A. Experiment Setup
Full system simulations were performed to evaluate the
proposed method under different mapping strategies. Since the
comparison in [12] shows that the GI algorithm achieves good
results compared with some other algorithms, the GI algorithm
was chosen as a reference to evaluate the tree-model based
algorithm used in task mapping. The tree-model based and GI
algorithms were used together with the BS, BSh and WNAD
strategies of application mapping to compare the performance
of these strategies. Four benchmark applications were selected,
three of them are from the SPLASH-2 [15] and PARSEC
[5] suite: FFT with 25 and 9 cores (FFT(25) and FFT(9)
respectively), X264 with 16 cores (X264(16)). Another TPCH with 16 cores (TPCH(16)) is an ad-hoc, decision support
benchmark from TPC [16]. The mapping was conducted in
order of FFT(25), X264(16), TPCH(16) and FFT(9).
B. Results Analysis
Average Communication Distance (ACD:hops)
∀tj ∈T
A cycle-accurate NoC simulator, Noxim [14], was extended
and used to simulate the four applications’ traffics on a
10 × 7 NoC and produce network delay and communication
energy consumption under different mapping strategies. The
workload traces of these four applications were gathered from
Simics [11] where the NoC was configured to model a chip
multiprocessor (CMP). Each PE has a core, a private L1
cache and a shared L2 cache bank. Memory controllers are
connected to the top and bottom side of the chip. The static
non uniform cache architecture (NUCA) [3] is implemented
in our memory/cache architecture, in which data are mapped
to cache banks statically.
The average communication distance (ACD), WCA, average
network latency (ANL), energy consumption (EC) under the
different strategies were compared. The ACD is the average
communication distance among all tasks when the application
is mapped on the selected objective MER. The ANL is the
average number of cycles needed for transferring one packet
on the NoC.
4
BS+GI
BS+Tree
BSh+GI
BSh+Tree
WNAD+GI
WNAD+Tree
3.5
3
2.5
2
1.5
1
0.5
0
FFT(25)
X264(16)
TPCH(16)
FFT(9)
Fig. 6: ACD Using Different Strategies
Weighted Communcation of Application(WCA)
onto the root node in the tree. Then we calculate the weighted
communication volumes of the remaining tasks to the ones
already mapped onto the tree, termed as affinity to partial tree
(AP T , Definition 4), and place the task with the largest AP T
to the highest node available in the tree. This procedure iterates
until all tasks have been mapped onto the tree.
Definition 3: Let ti be a task in the task graph TG, and cij
be the communication volume from ti to tj , then
X
CVti =
(wij + wji )
100%
BS+GI
BS+Tree
BSh+GI
BSh+Tree
WNAD+GI
WNAD+Tree
80%
60%
40%
20%
0%
FFT(25)
X264(16)
TPCH(16)
FFT(9)
Fig. 7: WCA Using Different Strategies
Figure 6 shows the ACD for each application under different
strategies. BS+GI and BS+Tree respectively represent the
cases where BS is applied to the application mapping and the
GI and tree-model based algorithm to the task mapping, and
so forth. The variant ACDs of the X264, TPCH and FFT(9)
under different strategies show the impact of objective MER
on the ACD. For each of them, the optimal ACD is achieved
when they are mapped on an objective MER with minimal
aspect ratio A(R), or intuitively, a rectangle close to square.
Non-optimized mappings result highest ACD for X264 (3.14,
25% higher than the optimal 2.49), TPCH (3.14, 25% higher
than the optimal 2.49) and FFT (9) (2.49, 23% higher than the
optimal 2.03). For most applications, the WNAD can obtain
the same or better solution than that of BS and BSh strategy.
The only exception is the case of TPCH where one primary
MER combing another MER is selected for the mapping under
the WNAD strategy, instead of a monolithic MER under the
BS and BSh strategy. This does prove the negative impact
of separate MERs on the ACD. Also note, the task mapping
algorithm has negligible impact on the ACD.
The normalized WCA of each application under different
strategies is displayed in Figure 7. The impact of the application mapping on the WCA keeps consistent with that on the
ACD (shown in Figure 6). However, it is noteworthy that the
task mapping has a great impact on the WCA. In all cases,
the tree-model based algorithm outperforms the GI algorithm.
For example, using the WNAD+Tree strategy, the tree-model
based algorithm achieves 35%, 17%, 16% and 32% lower
WCA than the GI algorithm for each application. Furthermore,
WNAD+Tree contributes the lowest WCA for each application
among all strategies.
Average Network Latency (ANL)
100%
GI
Tree−Model Based
80%
60%
40%
20%
0%
BS
BSh
WNAD
Fig. 8: ANL Using Different Strategies
Energy Consumption (EC)
100%
GI
Tree−Model Based
80%
60%
40%
20%
0%
BS
BSh
WNAD
Fig. 9: EC Using Different Strategies
The normalized simulation results of the ANL and the EC
are demonstrated in Figure 8 and 9. As anticipated by the
WCA in Figure 7, the tree-model based algorithm achieves
lower ANL and EC than the GI algorithm. The ANL of treemodel based algorithm is 12%, 15%, 13% lower than that of
the GI under BS, BSh and WNAD strategies respectively. The
same achievements keeps for the EC. Furthermore, WNAD
strategy outperform the BS and BSh and achieves lowest
ANL and EC (about 5% lower in average ). For this set of
applications, the difference between BS and BSh is negligible
with respect to the ANL and EC. The lowest ANL and EC
are achieved by WNAD+Tree which is 18% lower compared
with the worst case under BSh+GI strategy.
V. C ONCLUSION
An innovative method for multiple applications mapping on
the future many-core NoC is proposed. The two-step mapping
method first finds a region on the NoC for a given application
and then maps all tasks of the application into the region.
Several strategies based on the MER technique, e.g. BS,
BSh and WNAD are introduced for the application mapping.
By using these strategies, the algorithm can efficiently find
the optimal objective MER to map the target application.
Following the application mapping, a tree-model based algorithm is proposed for the task mapping and compared against
an existing GI algorithm. The experiment shows that in a
common case, the MER with minimal aspect ratio is ideal for
mapping a given application. Among the proposed strategies
for application mapping, the WNAD is likely to obtain a better
solution than BS and BSh. For the task mapping, the proposed
tree-model based algorithm outperforms the GI algorithm on
achieving lower network latency and energy consumption.
WNAD+Tree strategy achieves lowest network latency and
energy consumption among all strategies.
VI. ACKNOWLEDGEMENT
The authors would like to thank the Academy of Finland
for the financial support for this work.
R EFERENCES
[1] Krste Asanovic, Ras Bodik, Bryan C. Catanzaro, Joseph J. Gebis, Parry
Husbands, Kurt Keutzer, David A. Patterson, William L. Plishker, John
Shalf, Samuel W. Williams, and Katherine A. Yelick. The landscape of
parallel computing research: a view from berkeley. (UCB/EECS-2006183), December 2006.
[2] K. Bazargan, R. Kastner, and M. Sarrafzadeh. Fast template placement
for reconfigurable computing systems. Design Test of Computers, IEEE,
17(1):68 –83, jan-mar 2000.
[3] Bradford M. Beckmann and David A. Wood. Managing wire delay in
large chip-multiprocessor caches. In Proceedings of the 37th annual
IEEE/ACM International Symposium on Microarchitecture, pages 319–
330, December 2004.
[4] L. Benini and G. De Micheli. Networks on chips: a new soc paradigm.
Computer, 35(1):70–78, Jan 2002.
[5] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The
parsec benchmark suite: characterization and architectural implications.
In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 72–81, October 2008.
[6] M. Denneau and H. S Warren, Jr. 64-bit cyclops: Principles of operation.
IBMTech-report, 2005.
[7] P.P. Gelsinger. Microprocessors for the new millennium: Challenges,
opportunities, and new frontiers. In Proceedings of The International
Solid State Circuits Conference (ISSCC), pages 22–25, 2001.
[8] J. Hennessy and D. Patterson. Computer Architecture: A Quantitative
Application, 4th Edition. Morgan Kauffman, 2007.
[9] Radu Marculescu Jingcao Hu. Energy- and performance-aware mapping
for regular noc architecture. IEEE Transations On Computer-Aided
Design of Integrated Circuits and Systems, Vol.24, No.4:551–562, 2005.
[10] Tang Lei and Shashi Kumar. A two-step genetic algorithm for mapping
task graphs to a network on chip architecture. In DSD ’03: Proceedings
of the Euromicro Symposium on Digital Systems Design, page 180,
Washington, DC, USA, 2003. IEEE Computer Society.
[11] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg,
J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full
system simulation platform. Computer, 35(2):50–58, February 2002.
[12] C.A.M. Marcon, E.I. Moreno, N.L.V. Calazans, and F.G. Moraes.
Evaluation of algorithms for low energy mapping onto nocs. In Proc.
IEEE International Symposium on Circuits and Systems ISCAS 2007,
pages 389–392, 2007.
[13] Srinivasan Murali, Martijn Coenen, Andrei Radulescu, Kees Goossens,
and Giovanni De Micheli. Mapping and configuration methods for multiuse-case networks on chips. In ASP-DAC ’06: Proceedings of the 2006
Asia and South Pacific Design Automation Conference, pages 146–151,
Piscataway, NJ, USA, 2006. IEEE Press.
[14] University of Catania. Noxim. http://www.noxim.org/.
[15] Jaswinder Pal Singh, Anoop Gupta, Moriyoshi Ohara, Evan Torrie, and
Steven Cameron Woo. The splash-2 programs: Characterization and
methodological considerations. Computer Architecture, International
Symposium on, 0:24, 1995.
[16] TPC. Tpc-h. http://www.tpc.org/tpch/.
[17] S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz,
D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts,
Y. Hoskote, N. Borkar, and S. Borkar. An 80-tile sub-100-w teraflops
processor in 65-nm cmos. Solid-State Circuits, IEEE Journal of,
43(1):29–41, 2008.
[18] Bo Yang, Thomas Canhao Xu, Tero Santti, and Juha Plosila. Tree-model
based mapping for energy-efficient and low-latency network-on-chip. In
Design and Diagnostics of Electronic Circuits and Systems (DDECS),
pages 189 –192, 14-16 2010.