Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Communication-aware Design Space Exploration for Efficient Run-time MPSoC Management

...Read more
Communication-aware Design Space Exploration for Efficient Run-time MPSoC Management 1 Amit Kumar Singh, 2 Akash Kumar, 3 Wu Jigang and 1 Thambipillai Srikanthan 1 School of Computer Engineering, Nanyang Technological University, Singapore 2 Department of Electrical and Computer Engineering, National University of Singapore, Singapore 3 School of Computer Science and Software, Tianjin Polytechnic University, Tianjin, China Email: 1 {amit0011, astsrikan}@ntu.edu.sg, 2 akash@nus.edu.sg, 3 asjgwu@gmail.com Abstract—Real-time multi-media applications are increasingly mapped on modern embedded systems based on Multiprocessor Systems-on-Chip (MPSoCs). Tasks of the applications need to be mapped on the MPSoC resources efficiently in order to satisfy their performance constraints. Exploring all the mappings, i.e. tasks to resources combinations exhaustively may take days or weeks. Additionally, the exploration is performed at design-time that cannot handle dynamism in applications and resources’ status. A run-time mapping technique can cater for the dy- namism but cannot guarantee for strict timing deadlines due to large computations involved at run-time. Thus, an approach performing feasible compute intensive exploration at design-time and using the explored results at run-time is required. This paper presents a solution in the same direction. Communication- aware design space exploration techniques have been proposed to explore different mapping options to be selected at run-time subject to desired performance and available MPSoC resources. Experiments show that the proposed techniques for exploration are faster over an exhaustive exploration and provides almost the same quality of results. I. I NTRODUCTION Advanced multimedia embedded systems (e.g., smart phones, tablets, PDAs) need to support multiple applications concurrently. The increasing performance demands of concur- rently running applications are satisfied by relying the systems on multiprocessor systems-on-ship (MPSoCs), for example, IBM Cell [1] and NXP Nexperia [2]. The MPSoCs may con- tain different type of processing elements (PEs) connected by a communication network in order to achieve high performance by exploiting their distinct features. The system users expect that throughput constraints of all applications running in the system are satisfied which heavily depends upon how efficiently application tasks are mapped onto system resources (PEs). There is an enormous number of possibilities for mapping the tasks onto the PEs. The mapping is accomplished either by design-time DSE [3][4] or run-time mapping strategies [5][6]. The design-time DSE strategies are incapable of handling dynamism such as adding a new appli- cation into the platform at run-time. On the other hand, the run-time mapping strategies cannot provide timing guarantees due to lack of any previous analysis and limited computational resources at run-time. Thus, an approach performing compute intensive analysis (DSE) at design-time and using the analysis results at run-time is required. The design-time DSE strategies need to find a number of mappings by taking an application and a platform as input. An exhaustive DSE to find all the possible mappings is not scalable when the number of tasks/PEs is large as we need to explore for lot of tasks to PEs combinations. Existing DSE strategies are applicable only to fixed MPSoC platform, don’t scale well with the number of PEs in the platform and don’t always provide the largest throughput mapping as they perform DSE in view of optimizing for the performance metrics such as energy and resource optimization. Contribution: This paper presents design-time DSE strate- gies that perform analysis on a generic MPSoC platform in view of optimizing throughput and produce resource- throughput trade-off points, i.e. tasks to PEs mappings with their throughput. A resource has been referred as a tile that essentially contains a processing engine along with other elements such as memory. The platform contains different type of tiles such as a processor or a reconfigurable hardware (RH) block as the processing engine, i.e. the platform is heterogeneous. The generated points can be used by a light- weight run-time manager to select the best point depending upon the available tiles in the platform and desired throughput. First, an exhaustive DSE strategy is presented that produces all the possible tasks to PEs mappings, which is not scalable with the number of tasks in the application. To overcome the large exploration overhead (may be a couple of weeks for large application size), we present a communication-aware DSE (CADSE) strategy that discards the evaluation of inefficient points and produces almost the same best trade-off points. To further accelerate the DSE, we incorporated pruning in the CADSE (PCADSE) where evaluation of the number of trade-off points is further reduced based on a pruning idea. The quality of the best mappings generated by the CADSE and PCADSE strategies do not differ significantly, while the exploration process is speeded up. Overview. Section II introduces state-of-the-art multi- processor DSE strategies. Our proposed DSE methodologies along with multiprocessor/application model used in this work are introduced in Section III. Section IV presents a set of ex- perimental results on the efficiency of the proposed approach. Section V concludes the paper and provides directions for future work. II. RELATED WORK Several DSE strategies providing single mapping for an application have been reported in literature [7][8][9]. These strategies are applicable only to fixed MPSoC platforms and mappings are not optimized from throughput point of view as throughput optimization is not their target but to satisfy some constraint. Further, they cannot handle dynamism in resource availability and throughput (QoS) requirement at run-time. However, our DSE strategies are applied to a generic MPSoC platform and generate a number of mappings with different resource requirement and throughput, which helps to handle 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming 978-0-7695-4575-2/11 $26.00 © 2011 IEEE DOI 10.1109/PAAP.2011.18 65 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming 978-0-7695-4575-2/11 $26.00 © 2011 IEEE DOI 10.1109/PAAP.2011.18 72 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming 978-0-7695-4575-2/11 $26.00 © 2011 IEEE DOI 10.1109/PAAP.2011.18 72 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming 978-0-7695-4575-2/11 $26.00 © 2011 IEEE DOI 10.1109/PAAP.2011.18 72
        Fig. 1. Multiprocessor platform example. run-time dynamism and allows them to be mapped on any architecture without requiring analysis to be repeated. DSE strategies providing multiple mappings for homoge- neous MPSoC platforms have been recently presented in [4], [10] and [11]. In [4], DSE is performed in view of optimizing for the resource usage, whereas in [10] and [11], for optimizing power. These strategies have several drawbacks, e.g., applica- ble only to fixed homogeneous platforms, generated mappings are not optimized from throughput point of view, generate duplicate mappings for larger platforms and not scalable with the platform size. The duplicate mappings have the same throughput but they differ in placement of tasks on different tiles with the same tasks to tiles binding. Singh et al. [12] propose a DSE strategy that performs exploration in view of optimizing throughput by considering a generic platform but the considered platform is homogeneous, i.e., contains only one type of tile, and the exploration follows a pruning strategy. Our strategy considers a generic platform containing dif- ferent types of tiles depending upon the application tasks’ specifications and always provides mappings with maximum throughput by performing exploration in a communication- aware manner. The generated mappings are applicable to any platform containing tiles with maximum separation between two of them as the one considered during DSE without repeating the DSE. The generation of duplicate mappings are avoided by not considering a bigger platform than required that can exploit all the parallelism present in the application. III. PROPOSED DESIGN SPACE EXPLORATION METHODOLOGIES This section introduces our proposed DSE methodologies. First, we describe the hardware MPSoC architecture and application model used in our DSE methodologies. Multiprocessor Architecture Model. The hardware archi- tecture model describes platform processing units and the in- terconnection network between them. The platform model uses tile-based architecture that uses an interconnection network to connect the tiles as shown in the example platform of Fig. 1. The platform contains tiles t 1 , t 2 , t 3 & t 4 , which are connected by point-to-point connections (c) with fixed latencies. Latency of connections through any Network-on-Chip (NoC) can be modeled so long as the latencies between tiles are provided. Each tile contains a processing engine (processor P or RH), a local memory (M, size in bits), a set of communication buffers called network interface (NI) that are accessed both by the interconnect and the local processor, and maximum number of input/output connections to connect with the NI that provide maximum incoming/outgoing bandwidth (in bits/time- unit). Multiprocessor systems such as StepNP [13] and Eclipse [14] fit nicely into this platform model.                        Fig. 2. SDF Graph model of an H.263 decoder. The communication network used in the example platform of Fig. 1 is arranged in a 2-D mesh topology. The distance between two tiles is referred to as hop distance. Adjacent tiles t 1 & t 2 are at hop distance of 1 and t 1 & t 4 at hop distance of 2 (1 hop in X-direction to reach t 2 and 1-hop in Y-direction to reach t 4 ). The latency of connections between the tiles is directly proportional to hop distance. We increase the latency of connections between the tiles to account for the higher hop distances. This facilitates for finding mappings even when the tiles are further apart in the actual platform. Application Model. The application model considers throughput-constrained multimedia applications consisting of multiple tasks. Synchronous Dataflow Graphs (SDFGs) [15] are used to model such applications. Throughput is an im- portant constraint and determines how often tasks of the application finish their execution, which is determined by the cycles in the SDFG. An SDFG model of H.263 decoder application is shown in Fig. 2. Nodes modeling tasks are called actors that communicate with tokens sent from one actor to another through edges modeling dependencies. The application is modeled with four actors vld, iq, idct & mc and four edges d1, d2, d3 & d4. An actor has following attributes: its implementation alternatives (e.g., processor and RH tile), execution time (in time-units) and memory needed (in bits) on the implementation alternatives. An edge has following attributes: size of a token (in bits), memory (in tokens) needed when connected actors are allocated to the same tile, memory (in tokens) needed in source and destination tiles, bandwidth (in bits/time-unit) needed when connected actors are allocated to different tiles. An actor fires (executes) when there are sufficient tokens on all of its input edges and sufficient buffer space on all of its output channels. At each firing, a fixed number of tokens from the input edges are consumed and a fixed number of tokens on the output edges are produced. These numbers are referred to rates that define how often actors have to fire with respect to each other. The edges may have initial tokens to start the actor firing, indicated by a bullet in the Fig. 2. The application model also specifies a throughput-constraint. Existing DSE strategies find mappings while performing optimization for power and resource usage. This might lead to mapping of parallel executing actors on the same tile and thus forcing their execution sequentially, resulting in reduced throughput. However, our strategies optimize throughput so produce mappings with maximum throughput. Our strategies map the connected and sequentially executing actors on the same tile, resulting in reduced communication overhead that might maximize throughput. A number of mappings are eval- uated for each multimedia application to be supported on a hardware platform. The evaluation considers finding different 66 73 73 73
2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming Communication-aware Design Space Exploration for Efficient Run-time MPSoC Management 1 Amit Kumar Singh, 2 Akash Kumar, 3 Wu Jigang and 1 Thambipillai Srikanthan School of Computer Engineering, Nanyang Technological University, Singapore Department of Electrical and Computer Engineering, National University of Singapore, Singapore 3 School of Computer Science and Software, Tianjin Polytechnic University, Tianjin, China Email: 1 {amit0011, astsrikan}@ntu.edu.sg, 2 akash@nus.edu.sg, 3 asjgwu@gmail.com 1 2 scale well with the number of PEs in the platform and don’t always provide the largest throughput mapping as they perform DSE in view of optimizing for the performance metrics such as energy and resource optimization. Contribution: This paper presents design-time DSE strategies that perform analysis on a generic MPSoC platform in view of optimizing throughput and produce resourcethroughput trade-off points, i.e. tasks to PEs mappings with their throughput. A resource has been referred as a tile that essentially contains a processing engine along with other elements such as memory. The platform contains different type of tiles such as a processor or a reconfigurable hardware (RH) block as the processing engine, i.e. the platform is heterogeneous. The generated points can be used by a lightweight run-time manager to select the best point depending upon the available tiles in the platform and desired throughput. First, an exhaustive DSE strategy is presented that produces all the possible tasks to PEs mappings, which is not scalable with the number of tasks in the application. To overcome the large exploration overhead (may be a couple of weeks for large application size), we present a communication-aware DSE (CADSE) strategy that discards the evaluation of inefficient points and produces almost the same best trade-off points. To further accelerate the DSE, we incorporated pruning in the CADSE (PCADSE) where evaluation of the number of trade-off points is further reduced based on a pruning idea. The quality of the best mappings generated by the CADSE and PCADSE strategies do not differ significantly, while the exploration process is speeded up. Overview. Section II introduces state-of-the-art multiprocessor DSE strategies. Our proposed DSE methodologies along with multiprocessor/application model used in this work are introduced in Section III. Section IV presents a set of experimental results on the efficiency of the proposed approach. Section V concludes the paper and provides directions for future work. Abstract—Real-time multi-media applications are increasingly mapped on modern embedded systems based on Multiprocessor Systems-on-Chip (MPSoCs). Tasks of the applications need to be mapped on the MPSoC resources efficiently in order to satisfy their performance constraints. Exploring all the mappings, i.e. tasks to resources combinations exhaustively may take days or weeks. Additionally, the exploration is performed at design-time that cannot handle dynamism in applications and resources’ status. A run-time mapping technique can cater for the dynamism but cannot guarantee for strict timing deadlines due to large computations involved at run-time. Thus, an approach performing feasible compute intensive exploration at design-time and using the explored results at run-time is required. This paper presents a solution in the same direction. Communicationaware design space exploration techniques have been proposed to explore different mapping options to be selected at run-time subject to desired performance and available MPSoC resources. Experiments show that the proposed techniques for exploration are faster over an exhaustive exploration and provides almost the same quality of results. I. I NTRODUCTION Advanced multimedia embedded systems (e.g., smart phones, tablets, PDAs) need to support multiple applications concurrently. The increasing performance demands of concurrently running applications are satisfied by relying the systems on multiprocessor systems-on-ship (MPSoCs), for example, IBM Cell [1] and NXP Nexperia [2]. The MPSoCs may contain different type of processing elements (PEs) connected by a communication network in order to achieve high performance by exploiting their distinct features. The system users expect that throughput constraints of all applications running in the system are satisfied which heavily depends upon how efficiently application tasks are mapped onto system resources (PEs). There is an enormous number of possibilities for mapping the tasks onto the PEs. The mapping is accomplished either by design-time DSE [3][4] or run-time mapping strategies [5][6]. The design-time DSE strategies are incapable of handling dynamism such as adding a new application into the platform at run-time. On the other hand, the run-time mapping strategies cannot provide timing guarantees due to lack of any previous analysis and limited computational resources at run-time. Thus, an approach performing compute intensive analysis (DSE) at design-time and using the analysis results at run-time is required. The design-time DSE strategies need to find a number of mappings by taking an application and a platform as input. An exhaustive DSE to find all the possible mappings is not scalable when the number of tasks/PEs is large as we need to explore for lot of tasks to PEs combinations. Existing DSE strategies are applicable only to fixed MPSoC platform, don’t 978-0-7695-4575-2/11 $26.00 © 2011 IEEE DOI 10.1109/PAAP.2011.18 II. R ELATED W ORK Several DSE strategies providing single mapping for an application have been reported in literature [7][8][9]. These strategies are applicable only to fixed MPSoC platforms and mappings are not optimized from throughput point of view as throughput optimization is not their target but to satisfy some constraint. Further, they cannot handle dynamism in resource availability and throughput (QoS) requirement at run-time. However, our DSE strategies are applied to a generic MPSoC platform and generate a number of mappings with different resource requirement and throughput, which helps to handle 65 72                 Fig. 1.   Multiprocessor platform example. Fig. 2. run-time dynamism and allows them to be mapped on any architecture without requiring analysis to be repeated. DSE strategies providing multiple mappings for homogeneous MPSoC platforms have been recently presented in [4], [10] and [11]. In [4], DSE is performed in view of optimizing for the resource usage, whereas in [10] and [11], for optimizing power. These strategies have several drawbacks, e.g., applicable only to fixed homogeneous platforms, generated mappings are not optimized from throughput point of view, generate duplicate mappings for larger platforms and not scalable with the platform size. The duplicate mappings have the same throughput but they differ in placement of tasks on different tiles with the same tasks to tiles binding. Singh et al. [12] propose a DSE strategy that performs exploration in view of optimizing throughput by considering a generic platform but the considered platform is homogeneous, i.e., contains only one type of tile, and the exploration follows a pruning strategy. Our strategy considers a generic platform containing different types of tiles depending upon the application tasks’ specifications and always provides mappings with maximum throughput by performing exploration in a communicationaware manner. The generated mappings are applicable to any platform containing tiles with maximum separation between two of them as the one considered during DSE without repeating the DSE. The generation of duplicate mappings are avoided by not considering a bigger platform than required that can exploit all the parallelism present in the application.                                                SDF Graph model of an H.263 decoder. The communication network used in the example platform of Fig. 1 is arranged in a 2-D mesh topology. The distance between two tiles is referred to as hop distance. Adjacent tiles t1 & t2 are at hop distance of 1 and t1 & t4 at hop distance of 2 (1 hop in X-direction to reach t2 and 1-hop in Y-direction to reach t4 ). The latency of connections between the tiles is directly proportional to hop distance. We increase the latency of connections between the tiles to account for the higher hop distances. This facilitates for finding mappings even when the tiles are further apart in the actual platform. Application Model. The application model considers throughput-constrained multimedia applications consisting of multiple tasks. Synchronous Dataflow Graphs (SDFGs) [15] are used to model such applications. Throughput is an important constraint and determines how often tasks of the application finish their execution, which is determined by the cycles in the SDFG. An SDFG model of H.263 decoder application is shown in Fig. 2. Nodes modeling tasks are called actors that communicate with tokens sent from one actor to another through edges modeling dependencies. The application is modeled with four actors vld, iq, idct & mc and four edges d1, d2, d3 & d4. An actor has following attributes: its implementation alternatives (e.g., processor and RH tile), execution time (in time-units) and memory needed (in bits) on the implementation alternatives. An edge has following attributes: size of a token (in bits), memory (in tokens) needed when connected actors are allocated to the same tile, memory (in tokens) needed in source and destination tiles, bandwidth (in bits/time-unit) needed when connected actors are allocated to different tiles. An actor fires (executes) when there are sufficient tokens on all of its input edges and sufficient buffer space on all of its output channels. At each firing, a fixed number of tokens from the input edges are consumed and a fixed number of tokens on the output edges are produced. These numbers are referred to rates that define how often actors have to fire with respect to each other. The edges may have initial tokens to start the actor firing, indicated by a bullet in the Fig. 2. The application model also specifies a throughput-constraint. Existing DSE strategies find mappings while performing optimization for power and resource usage. This might lead to mapping of parallel executing actors on the same tile and thus forcing their execution sequentially, resulting in reduced throughput. However, our strategies optimize throughput so produce mappings with maximum throughput. Our strategies map the connected and sequentially executing actors on the same tile, resulting in reduced communication overhead that might maximize throughput. A number of mappings are evaluated for each multimedia application to be supported on a hardware platform. The evaluation considers finding different III. P ROPOSED D ESIGN S PACE E XPLORATION M ETHODOLOGIES This section introduces our proposed DSE methodologies. First, we describe the hardware MPSoC architecture and application model used in our DSE methodologies. Multiprocessor Architecture Model. The hardware architecture model describes platform processing units and the interconnection network between them. The platform model uses tile-based architecture that uses an interconnection network to connect the tiles as shown in the example platform of Fig. 1. The platform contains tiles t1 , t2 , t3 & t4 , which are connected by point-to-point connections (c) with fixed latencies. Latency of connections through any Network-on-Chip (NoC) can be modeled so long as the latencies between tiles are provided. Each tile contains a processing engine (processor P or RH), a local memory (M, size in bits), a set of communication buffers called network interface (NI) that are accessed both by the interconnect and the local processor, and maximum number of input/output connections to connect with the NI that provide maximum incoming/outgoing bandwidth (in bits/timeunit). Multiprocessor systems such as StepNP [13] and Eclipse [14] fit nicely into this platform model. 73 66 & & &, '()(   till hop distance reaches to max hop distance (input to the DSE flow). The designers can opt for a suitable value of max hop distance depending upon the expected hardware platform at run-time, where, maximum hop distance between two tiles can be up to max hop distance. Varying hop distance consideration provides mappings where application edges are mapped to connections at hop distance of one (to cater for minimum latency) to max hop distance (to cater for maximum latency). This caters for the run-time aspects when the available tiles are at different hop distances. 2) Evaluating Processor Tiles Mappings: This step evaluates all the possible actors to processor (Proc) tiles mappings. The evaluation follows a set of steps described subsequently. An application with one actor (a1 ) to be mapped on Proc tiles has only one unique actor to tile mapping, which is computed from equation 1. An application with two actors (a1 ,a2 ) has two unique mappings that is computed from equation 2. One mapping contains actors on separate tiles (1 C0 implies that from the remaining one actor a1 , it is not chosen to combine it with actor a2 ) and another on the same tile (1 C1 implies that actor a1 is chosen to combine it with actor a2 ). Similarly, for an application with three actors (a1 ,a2 ,a3 ), the unique mappings are computed from equation 3. First, actor a3 is mapped separately, i.e., not combined with others (from the remaining two actors a1 & a2 , none is chosen to combine with a3 , indicated as 2 C0 ) and remaining two actors are mapped by using equation 2 (fEDSE (2,a1 ,a2 )), providing two unique mappings. Then, from the remaining two actors one actor is chosen to combine with actor a3 (2 C1 ) and the remaining actor is mapped separately, providing two unique mappings. Next, from the remaining two actors, both are chosen to combine with actor a3 , providing one unique mapping. Thus, for an application with three actors, a total of five unique actors to tiles mappings are evaluated. The equations are extended in the similar manner for larger number of actors. For n actors (a1 ,a2 ,...,an ), the mappings can be computed from equation 4. It can be observed that when computing mappings for larger number of actors, the mappings computed at lower number of actors are used, such as fEDSE (n − 1,a1 ,a2 ,...,an−1 ) in fEDSE (n,a1 ,a2 ,...,an ). 0)& 5# %#  6&738  .' 6&8595 6&,8  .    .     ! "  ## "$      %  #    *    + +"(    *   "    %  :        )  )  ,    .'()(  /   %   )%  - , ,/ - 0) Fig. 3. 1$ 2)) 1234 Exhaustive Design Space Exploration Flow. mappings and their throughput. For each mapping, actors are bound to tiles and edges to memory inside tiles or to connections in the platform. The binding is considered valid if memory imposed, allocated input/output connections and allocated incoming/outgoing bandwidth are less than or equal to the maximum available on each tile. Only the valid bindings are considered and throughput for the same is computed. For computing throughput, first, a static-order schedule for each tile is constructed, which orders the execution of bound actors. Then, all the binding and scheduling decisions are modeled in a graph called binding-aware SDFG. Finally, throughput is computed by self-timed state-space exploration of the bindingaware SDFG [16]. We now introduce our DSE strategies. A. Exhaustive Design Space Exploration The exhaustive design space exploration (EDSE) flow evaluates all the possible actors to tiles combinations, i.e., mappings. The flow is presented in Fig. 3. The flow takes application models as input and stores the best mapping (MTDB) at each possible resource combination. The applications are evaluated one after another by incrementing the application number (appNumber++). The main steps of the flow are highlighted and described subsequently. 1) Considering a Suitable Platform Graph: This step of the analysis flow considers a platform graph that can evaluate all possible mappings for each application. The application actors implementation alternatives could be a number of tile types. In particular, we have considered processor (Proc) and reconfigurable hardware (RH) tiles. The considered platform contains max nA Proc & max nA RH tiles, where max nA is the maximum value of number of actors in an application amongst all the applications. This platform is capable of exploiting all the parallelism present in each of the application and considering any bigger platform wouldn’t provide better performance. The initial considered platform contains tiles separated by a distance of one hop distance (hop distance = 1), which caters for a minimum latency for all the connections between the tiles. The DSE flow is repeated by considering a similar platform containing tiles separated by one higher hop distance (hop distance++), i.e., with increased latency for connections, fEDSE (1, a1 ) = 1 (1) fEDSE (2, a1 , a2 ) =1 C0 × f (1, a1 ) +1 C1 (2) fEDSE (3, a1 , a2 , a3 ) =2 C0 × f (2, a1 , a2 ) +2 C1 × f (1, remain actor) +2 C2 (3) . . . fEDSE (n, a1 , a2 , ..., an ) =(n−1) C0 × f (n − 1, a1 , a2 , ..., an−1 ) +(n−1) C1 × f (n − 2, remain actors) +(n−1) C2 × f (n − 3, remain actors) . . +(n−1) Cn−2 × f (1, remain actor) +(n−1) Cn−1 (4) 74 67 Algorithm 2: Distinct Channels Calculation Algorithm 1: Proc/RH tiles Comb. Mappings Evaluation Input: Application Graph Output: Number of distinct channels distinctChannelCount = 0; for actor ai = FirstActor (a1 ) to LastActor (an ) do for actor aj = actor next to ai (i.e. ai +1) to LastActor (an ) do if a channel exists between ai and aj then distinctChannelCount++; end end end Input: Proc tiles mappings M Output: Proc and RH tiles combination mappings to be added to set M for each Proc tiles mapping α (∈ M) do findProc&RHTilesCombMappings(α, t1 ); end function findProc&RHTilesCombMappings(Mapping β, Tile startProcTile) if startProcTile == lastProcTile+1 then return; end for Tile i = firstProcTile to lastProcTile (in current mapping) do if tile i contains actor(s) and all of them can be mapped on RH tile then Move actor(s) of tile i on a free RH tile to generate a new mapping α; Compute throughput of α; Add α with its throughput to set M ; findProc&RHTilesCombMappings(α, i+1); end end fCADSE (n, a1 , a2 , ..., an ) =(n−1) C0−conn × f (n − 1, a1 , a2 , ..., an−1 ) +(n−1) C1−conn × f (n − 2, remain actors) +(n−1) C2−conn × f (n − 3, remain actors) . . +(n−1) C(n−2)−conn × f (1, remain actor) +(n−1) C(n−1)−conn (5) To find whether a set of actors are connected, we find number of distinct channels between them. If there are more than one channel between two actors then only one channel is counted as the distinct channel. The actors are connected if the number of distinct channels between the actors is ≥ (the number of actors – 1). For n actors (a1 ,a2 ,...,an ), the total number of distinct channels are calculated from Algorithm 2. end function 3) Evaluating Processor and Reconfigurable Hardware Tiles Combination Mappings: The Proc and RH tiles combination mappings are evaluated by Algorithm 1, which takes Proc tiles mappings (M) obtained in the previous step (Fig. 3) as input. The algorithm evaluates a number of mappings and adds them in the same set. Thus, the final mapping set M contains all the evaluated mappings using different Proc/RH tiles combinations. For the example H.263 decoder application when each actor has two implementation alternatives, we get a total of 94 mappings at each hop distance value. 4) Selecting Best Mapping at Each Resource Combination: At each Proc/RH tiles combination, we get a number of mappings. For example, at 2Proc & 2RH tiles resource combination, we get a total of 6 mappings for the H.263 decoder. This step selects the maximum throughput mapping at each resource combination and stores it into the mappings & throughput database (MTDB) (Fig. 3). These stored mappings provide options for mapping the application at run-time. The best mapping can be selected based on the available platform resources and desired throughput. The selected mapping is then used to configure the platform. C. Pruning-based Communication-Aware Design Space Exploration The pruning-based communication-aware design space exploration (PCADSE) strategy incorporates pruning in the CADSE strategy. In Fig. 3, after evaluating Proc tiles mapping by CADSE, only the maximum throughput mapping at each Proc tile count is passed to evaluate Proc/RH tiles combination mappings as earlier. Proc tile count for a mapping is defined as the number of used Proc tiles. This strategy assumes that by starting with the best Proc tile mapping we should get the best mappings at Proc/RH tiles combinations. The pruning consideration facilitates for speeded exploration over the CADSE and provides almost the same quality (throughput) mappings. IV. E XPERIMENTS The proposed DSE methodologies have been implemented as an extension to the publicly available tool set SDF3 [17]. To evaluate run-time and quality of the methodologies, 100 random applications modeled as SDFGs with 4, 5, 6 and 7 actors having one of their implementation alternatives as Proc tile and other RH tile if present have been considered. The same generic platform graph is considered to evaluate the different DSE methodologies. We have adopted a tile-based architecture but any type of architecture can be modeled so long latencies between the tiles are known. The experiments have been performed on a Core 2 Duo processor at 3.16 GHz. The number of mappings evaluated by EDSE increases exponentially with the number of actors. Further, the number of mappings increases even more when the implementation alternatives of actors get increased. Thus, the exploration may take a couple of days. This makes the EDSE non-scalable although it always provides the best quality of mapping at each resource combination. The CADSE has been employed B. Communication-Aware Design Space Exploration The communication-aware design space exploration (CADSE) strategy incorporates communication awareness in the EDSE (Fig. 3). The step to explore Proc tiles mappings (Evaluate actors-to-Proc tiles mappings) is modified to perform the exploration in communication-aware manner. For an application with n actors (a1 ,a2 ,...,an ), the Proc tiles mappings are evaluated from equation 5, which requires n-1, n-2, n-3, ..., 2, 1 actors mappings in advance as in the EDSE. These mappings can be calculated by putting different values of n in the equation 5. This equation differs from equation 4 while choosing actors to be combined with actor an on the same tile. The chosen actors and actor an are checked if they are connected (conn) before they are mapped on the same tile. For example, (n−1) C2−conn in equation 5 specifies that the chosen (C) two actors and actor an are connected. 75 68   resources and hop distance between them.       V. C ONCLUSION This paper presents design space exploration (DSE) strategies for supporting efficient run-time MPSoC management. The strategies store the best mapping at each resource combination, which can be directly used at run-time depending upon the available resources and desired throughput. Three DSE strategies have been presented. One strategy performs DSE exhaustively (EDSE) and produces the best quality of mapping at each resource combination, whereas, this strategy has worst run-time. Next, a communication-aware DSE (CADSE) strategy is presented to perform the exploration in communicationaware manner. This reduces the total number of mappings to be evaluated and thus the run-time for exploration. The quality of mappings remains almost the same. To further reduce the exploration run-time, a pruning criteria has been incorporated in the CADSE, which provides a bit of degraded quality of mappings. In future, we plan to develop more ways of faster DSE in order to further speed up the exploration process while providing almost the same quality of mappings as of the EDSE.                               Fig. 4.  !"#$%&%&#''(%$#% Speed up obtained by CADSE and PCADSE over EDSE                                     !#(%)   Fig. 5. & *&' +#(  Quality of mappings by CADSE and PCADSE over EDSE R EFERENCES to speed up the exploration process while providing almost the same quality (throughput) of mappings. The PCADSE speeds up the exploration process further while providing a bit of degraded quality of mappings. Fig. 4 shows the speed up obtained by CADSE and PCADSE over the EDSE for all the 100 applications. The speed up by CADSE and PCADSE is calculated by dividing execution time of EDSE to the execution time of CADSE and PCADSE, respectively. The applications are sorted by the number of actors within them. It can be observed that CADSE is faster over the EDSE, and the PCADSE is faster even over the CADSE for all the applications. It is also clear that as the number of actors increases in the applications, the speed up obtained by the CADSE increases as the strategy discards evaluation of more number of mappings by incorporating communication-aware exploration, whereas speed up obtained by the PCADSE increases further as the strategy has to prune from a larger number of Proc tiles mappings. On an average, the CADSE and PCADSE is faster by 3.7× and 11×, respectively when compared to EDSE. Fig. 5 shows the quality (throughput) of the best mapping at 2 Proc and and 1 RH tiles resource combination for all the applications when tiles are assumed to be separated by a fixed hop distance. The best mapping throughput obtained by CADSE and PCADSE are normalized with respect to (w.r.t.) EDSE. The normalized throughput values are plotted after sorting them in descending order. It can be observed that the CADSE and PCADSE provide the same best mappings as of EDSE for more than 90% and 80% of the applications respectively. Similar behavior is obtained at other resource combinations. Thus, we can say that for most of the applications, we get the same quality of mappings by all the DSE strategies. The DSE methodologies store the best mappings at each resource combination at varying hop distance values (referred as hops). The best mapping at different hops remains the same with a bit of different quality of the mapping as the delays of connections between the tiles get changed. At run-time, the best mapping can be selected depending upon the available [1] M. Kistler et al., “Cell multiprocessor communication network: Built for speed,” IEEE Micro, vol. 26, pp. 10–23, 2006. [2] M. Kim et al., “Energy-aware cosynthesis of real-time multimedia applications on mpsocs using heterogeneous scheduling policies,” ACM Trans. Embed. Comput. Syst., vol. 7, pp. 9:1–9:19, 2008. [3] G. Ascia, V. Catania, A. G. Di Nuovo, M. Palesi, and D. Patti, “Efficient design space exploration for application specific systems-on-a-chip,” J. Syst. Archit., vol. 53, pp. 733–750, 2007. [4] S. Stuijk et al., “A predictable multiprocessor design flow for streaming applications with dynamic behaviour,” in Proceedings of the 13th Euromicro Conference on Digital System Design, 2010, pp. 548–555. [5] V. Nollet et al., “Run-time management of a mpsoc containing fpga fabric tiles,” IEEE Trans. Very Large Scale Integr. Syst., vol. 16, pp. 24–33, 2008. [6] A. K. Singh et al., “Communication-aware heuristics for run-time task mapping on noc-based mpsoc platforms,” J. Syst. Archit., vol. 56, pp. 242–255, 2010. [7] O. Moreira et al., “Scheduling multiple independent hard-real-time jobs on a heterogeneous multiprocessor,” in Proc. EMSOFT’07, 2007, pp. 57–66. [8] A. Bonfietti et al., “Throughput constraint for synchronous data flow graphs,” in Proc. CPAIOR ’09. Springer-Verlag, 2009, pp. 26–40. [9] A. Schranzhofer et al., “Dynamic power-aware mapping of applications onto heterogeneous mpsoc platforms,” Industrial Informatics, IEEE Transactions on, vol. 6, no. 4, pp. 692 –707, 2010. [10] C. Ykman-Couvreur at al., “Linking run-time resource management of embedded multi-core platforms with automated design-time exploration,” Computers Digital Techniques, IET, vol. 5, no. 2, pp. 123 –135, 2011. [11] G. Mariani et al., “An industrial design space exploration framework for supporting run-time resource management on multi-core systems,” in Proceedings of DATE, 2010, pp. 196–201. [12] A. K. Singh et al., “A hybrid strategy for mapping multiple throughputconstrained applications on mpsocs,” in accepted for publication in Proc. of CASES, 2011. [13] P. G. Paulin et al., “Application of a multi-processor soc platform to high-speed packet forwarding,” in Proc. DATE, 2004, pp. 58–63. [14] M. J. Rutten, J. T. J. van Eijndhoven, E. G. T. Jaspers, P. van der Wolf, E.-J. D. Pol, O. P. Gangwal, and A. Timmer, “A heterogeneous multiprocessor architecture for flexible media processing,” IEEE Des. Test, vol. 19, pp. 39–50, 2002. [15] E. A. Lee and D. G. Messerschmitt, “Static scheduling of synchronous data flow programs for digital signal processing,” IEEE Trans. Comput., vol. 36, pp. 24–35, 1987. [16] A. H. Ghamarian et al., “Throughput analysis of synchronous data flow graphs,” in Proc. ACSD, 2006, pp. 25–36. [17] S. Stuijk, M. Geilen, and T. Basten, “SDF3 : SDF For Free,” in Proc. ACSD, 2006, pp. 276–278. 76 69