G-MPSoC: Generic Massively Parallel Architecture on FPGA

jean-luc Dekeyser

Hana Krichene, Mouna Baklouti, Mohamed Abid Philippe Marquet, Jean-Luc Dekeyser WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS G-MPSoC: Generic Massively Parallel Architecture on FPGA HANA KRICHENE University of Lille 1 INRIA Lille Nord Europe Lille - France ENIS School CES Laboratory Sfax - Tunisia hana.krichene@inria.fr MOUNA BAKLOUTI MOHAMED ABID ENIS School CES Laboratory Sfax - Tunisia mouna.baklouti@enis.rnu.tn mohamed.abid@enis.rnu.tn PHILIPPE MARQUET JEAN-LUC DEKEYSER University of Lille 1 INRIA Lille Nord Europe Lille - France Philippe.Marquet@univ-lille1.fr jean-luc.dekeyser@univ-lille1.fr Abstract: Nowadays, recent intensive signal processing applications are evolving and are characterized by the diversity of algorithms (filtering, correlation, etc.) and their numerous parameters. Having a flexible and programmable system that adapts to changing and various characteristics of these applications reduces the design cost. In this context, we propose in this paper Generic Massively Parallel architecture (G-MPSoC). G-MPSoC is a System-on-Chip based on a grid of clusters of Hardware and Software Computation Elements with different size, performance, and complexity. It is composed of parametric IP-reused modules: processor, controller, accelerator, memory, interconnection network, etc. to build different architecture configurations. The generic structure of GMPSoC facilitates its adaptation to the intensive signal processing applications requirements. This paper presents G-MPSoC architecture and details its different components. The FPGA-based implementation and the experimental results validate the architectural model choice and show the effectiveness of this design. Key–Words: SoC, FPGA, MPP, Generic architecture, parallelism, IP-reused 1 Introduction cuit layout modifications. In this work, Xilinx Virtex6 ML605 board is used to implement the G-MPSoC architecture and to evaluate the experimental results. The remainder of the paper is structured as follows: Section 2 discusses some related works; section 3 describes the proposed architecture and its execution model; the implementation methodology on FPGA board is detailed in Section 4; then, the experimental results are discussed in Section 5; and finally, Section 6 concludes the paper and proposes some perspectives. The intensive signal processing applications are increasingly oriented to specialized hardware accelerators, which allow rapid treatment for specific tasks. However, these applications are characterized by repetitive tasks performing multiple data, which require massive parallelism for their efficient execution. To achieve high performance required by these applications, many massively parallel Systemon-Chips are proposed [1, 7, 5, 2]. Despite their effectiveness, these solutions are still dedicated to specific applications and it is generally difficult to make future changes to adapt new applications. To address this problem, this paper proposes a novel Generic Massively Parallel System-on-Chip, named G-MPSoC. This system, based on modular structure, allows building different configurations to cover a wide range of intensive signal processing applications. The architectural model of G-MPSoC can integrate software homogeneous computation units or hardware(accelerator)/software(processor) heterogeneous computation units. This generic feature allows the best partition of the specific tasks on the computational resources. To implement this system, the FPGA platform is targeted to exploit its reconfigurable structure, which facilitates the test of different G-MPSoC configurations with rapid re-design or cir- E-ISSN: 2224-266X 2 Related work Nowadays, the digital embedded systems migrate toward the massively parallel on-Chip design, due to its provided high performance. Among these systems, we note the General-Purpose Processing on Graphics Processing Units (GP-GPU) [2, 24], which is a massively parallel System-on-Chip based on a hybrid model between vector processing and hardware threading. It allows high performance by hiding the memory latency [1]. But, it loses efficiency with the large branch divergence between executing threads [3] when data dependency is demanded in intensive signal (image, sound, motion...) processing applications. In this gap, Platform 2012 (P2012) [1] is positioned 456 Volume 14, 2015 Hana Krichene, Mouna Baklouti, Mohamed Abid Philippe Marquet, Jean-Luc Dekeyser WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS 3 with highly coupled building blocks based on cluster of extensible processors varying from 1 to 16 and shared the same memory. Despite its high scalability, this architecture still specialized for limited range of applications such as multi-modal sensor fusion and image understanding. To provide more flexibility, a heterogeneous extension to the P2012 platform is proposed with He-P2012 [6] architecture. In He-P2012, the clusters of PEs used in P2012 are tightly coupled with hardware accelerators. All of them share the same data memory. With this architecture, a programming model is proposed, allowing the dispatch of Hardware and Software tasks with the same delay. Enlarge the initial software platform by additional hardware blocks can increase the design complexity with multiple communication interfaces. Others architectures have adopted the same strategy of massively parallel shared memory architecture, such as STM STHORM [1], Kalray MPPA [7], Plurality HAL [8], the NVIDIA Fermi GPU [9], etc. Although their provided high performance, sharing the same memory unit can limit the system bandwidth and cause some data access congestion, which limit the system scalability. The autonomous control structure for massively parallel System-on-Chip is proposed with the MPPA [4] architecture and Event-Driven Massively Parallel Fine-Grained Processor Array [25], executing in MIMD fashion. They provide more flexibility with asynchronous execution than the previous centralized and synchronous architectures. But, the use of a completely decentralized processing structure makes the control task of data transfer between independent computation units difficult to achieve. Generic Massively Parallel Architecture based on Synchronous Communication Asynchronous Computation 3.1 G-MPSoC architecture overview The G-MPSoC architecture is composed of a Master Controller Unit (MCU) connected to its sequential instructions memory, called MCU-memory, and a grid of Slave Control Units (SCUs). Each SCU is connected to a cluster of 16 Computation Elements (CEs), known collectively as Node. The CE can be a Software Processing Element (PE) or a Hardware specialized Accelerator-IP. Each CE is connected to its local instructions and data memories, called Mi memory. The parameter i is relative to the CE number. The MCU and SCUs grid are connected through a bus with single hierarchical level and the SCUs are connected together through neighbourhood interconnection network. Fig. 1 shows the hardware implementation of the G-MPSoC. Bellow, we present the design of the G-MPSoC components. M0 M0 CE0 M15 M15 SCU (0,2) M0 M0 CE15 CE15 M0 CE0 M1 M1 CE1 M15 CE1 M15 SCU (2,1) CE15 M15 SCU (2,2) CE15 Nodes grid Figure 1: G-MPSoC architecture 3.1.1 MCU The MCU is the first execution and control level in G-MPSoC. It is a simple processor, which fetches and decodes program instructions, executes sequential instructions and broadcasts mask activity and parallel control instructions to SCUs. In a parallel execution, MCU remains in the idle state and controls the end signal to resume the main program execution. In GMPSoC, the MCU is based on modified FC16 [10], which is a stack processor with short 16-bits instructions. It is open source and fully implemented in VHDL language, which allows rapid prototyping in In the next section, we detail the major components of G-MPSoC. E-ISSN: 2224-266X CE15 M0 CE1 Network Connection M15 SCU (1,2) CE0 M1 CE15 CE1 M15 SCU (1,1) M0 SCU (2,0) M1 CE1 CE0 CE Connection CE0 M1 CE1 Memory Connection M0 CE0 M1 M15 To overcome the limits of the existing proposed massively parallel architectures on-chip, we define a Generic Massively Parallel SoC (G-MPSoC), allowing the execution of wide range of intensive signal processing applications. To meet the high performance requirements of these applications, this generic architecture can have different configurations going from simple homogeneous structure to clustered heterogeneous structure communicated through regular Network-on-Chip (NoC) and controlled by hierarchical master-slave control structure [14]. This design is implemented into FPGA platform, which allows rapid reconfiguration of the architecture according to the designer needs. This reconfigurability based on programmable logic elements facilitates the system scalability. M15 CE15 SCU (0,1) CE0 SCU (1,0) CE1 CE15 SCU (0,0) Global Control M1 CE1 CE15 MCU CE0 M1 CE1 MCU Memory M0 CE0 M1 457 Volume 14, 2015 Hana Krichene, Mouna Baklouti, Mohamed Abid Philippe Marquet, Jean-Luc Dekeyser WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS FPGA with instructions set simple to expand. Some specific instructions are added to the FC16 instructions set, mainly: mains in idle state until the end of the parallel execution. It is a synchronising instruction, which depends on the end-execution signal value. • Mask instructions To activate the involved nodes in the parallel execution, the MCU executes mask instructions, as presented in table 1, and broadcasts the mask map and mask code to all SCUs. The mask is coded in 32 bits, to address an array of (16x16) nodes. The MCU is connected to the SCUs grid through a bidirectional bus to communicate the parallel control instructions from MCU to SCUs and the computation results from CEs to MCU via SCUs. The end signal connects all SCUs to inform the MCU of the end of the parallel execution. 3.1.2 SCUs array Table 1: Mask instructions selbf 0x0080 Mask Code (000) selbfand 0x0081 (001) selbfor 0x0082 (010) selbfxor 0x0083 (011) Definition The second execution and control level in G-MPSoC is presented by the SCUs array. Each SCU controls: the local node activity, the autonomous parallel execution in the CEs, the end execution and the synchronous communication in the interconnection network. As shown in fig. 2, it is composed of four modules, where each one of them independently performs a specific control function, as detailed in [14]. Activate SCUs selected by the mask. Activate SCUs in the intersection of the current mask and previous one. Activate SCUs in the union of the current mask and previous one. Activate SCUs in the union of the current mask and previous one except the intersection part. ortree end SCU_ORTREE ce_inst CMD(15:0) scu_inst SCU_ Activity data_rce ISCU(15:0) Local_Control scu_code go scu_data brdbfb 0x0085 (101) brdall 0x0086 (111) xnet_a SLCU_COM Figure 2: SCU design • SCU Activity module This module receives the mask activity and the parallel control instruction from the MCU. Then, according to the mask code and broadcast code, it sets the activity-flag and transfers the parallel control instruction to the Local Control module, respectively. We notice that the use of SCU Activity module in G-MPSoC architecture allows the sub-netting of the SCUs grid, which optimizes the data flow transfer and increases the parallel broadcast domains. Definition Broadcast parallel control instructions to active SCUs. Broadcast parallel control instructions to inactive SCUs. Broadcast parallel control instructions to all the SCUs. • Local Control module After sub-netting the network, the parallel control instructions are broadcast to a Local-Control • Wait end instruction When executing this instruction, the MCU re- E-ISSN: 2224-266X xnet_com SCU_ROUTER Network Interface 0x0084 COM A D SCU COM brdbf Mask Code (100) COM inst COM_Control Table 2: Mask instructions instruction OpCode A Dir • Broadcast instructions Coded in 32 bits, the broadcast instructions identify which area executes parallel control instructions (table 2). Once the mask is mapped into the SCUs grid and when the broadcast instruction is executed, the MCU fetches the parallel control instruction, and then broadcasts it to all selected SCUs followed by the broadcast code. r_w simd_scac E CE Interface MCU interface instruction OpCode 458 Volume 14, 2015 Hana Krichene, Mouna Baklouti, Mohamed Abid Philippe Marquet, Jean-Luc Dekeyser WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS module. This module prepares the communication phase and controls the CEs execution. The main sub-module in Local-Control is the Instruction Decoder, which decodes the instructions received from the SCU Activity module. These instructions are coded in 32 bits, as detailed in [14]: the first 16 bits represent the control microinstruction (CMD) and the last 16 bits represent: the address of the parallel instructions block, the single parallel SIMD instruction (P INST) or the value of the communicated data. տ ↑ ր → ց ↓ ւ ← • SCU COM module The SCU components in G-MPSoC are connected in two-dimensional neighbourhood interconnection network via the SCU-COM module. It allows the SCU to communicate with its 8 neighbours using only 6 connections. The SCU-COM is composed of COM Control and SCU router sub-modules. The COM Control sub-module manages the data transfer according to the communication instructions. The SCU router sub-module itself is composed of two routers, as presented in the fig. 3. S Local E Table 3: X-net directions Direction Code R-SCUXnet North West 0 0 տ North 1 0 տ North East 2 1 ր East 3 1 ր South East 4 2 ց South 5 2 ց South West 6 3 ւ West 7 3 ւ R-Xnet 0 տ 1 ր 1 ր 2 ց 2 ց 3 ւ 3 ւ 0 տ Each SCU-Router can take 4 different directions, allowing the data transfer in 8 directions, as detailed in table 3. When the COM-Control decodes the communication instructions, it orders the COM-Router to open the selected ports of the couple (R-Xnet,RSCUXnet) to achieve the data transfer in the specified direction. A COM-Control handles the communication requests and transfers data from the local SCU to the neighbour SCU in a given direction. Once the communication is established, data will be stored in R COM register to be used by the requester CE. Direction DEMUX MUX • OR Tree module The barrier synchronization is a high latency operation in massively parallel systems. Several systems have implemented either dedicated barrier networks [12] or provided hardware support within existing data networks [13]. The ORTree is a mechanism of global OR, checking the state of the system parallel processing. It is composed of a tree of ”OR” gates, which compares the end execution signals of all the CEs in pairs. It is a barrier synchronization that allows the controllers to know if all activated CEs finished the computation. The G-MPSoC supports a hierarchical OR Tree structure. The first level is in the SCU component to test the end execution in cluster of CEs and the second one is in the SCUs grid to test the end execution in all nodes of the system. (a) (b) Figure 3: Architecture of the SCU-Router (a) R-SCUXnet - (b) R-Xnet All the communications take place in the same direction, so there is no messages congestion in the same data transfer port. This feature allows the simple design of the routing element: – R-SCUXnet manages directions and distances of communications. It is composed of a couple of 4:1 mux/demux, allowing synchronous data transfer. The modular structure of the SCU and the independent execution of the control functions allow the autonomous parallel computation while performing the parallel communication. – R-Xnet allows simultaneous connection of any pair of non-occupied ports according to the given direction. E-ISSN: 2224-266X 459 Volume 14, 2015 Hana Krichene, Mouna Baklouti, Mohamed Abid Philippe Marquet, Jean-Luc Dekeyser WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS 3.1.3 X-net: neighbourhood interconnection network To maintain efficient execution in embedded systems and high performance through a selective broadcast, we propose an on-chip regular neighbourhood interconnection network inspired from the network used in MP-1 and MP-2 MasPar [22] machines: called Xnet Network-on-Chip. The X-net network uses various configurations according to the data-parallel algorithms needs. Therefore, we define different bus sizes (1 bit, 4 bits or 16 bits) and different network topologies 1D (linear and ring) and 2D (mesh and torus). To change from one configuration to another, the designer has to use specific parameters (topology and bus size values) to establish the appropriate connections. When the network parameters are selected, the X-net network is generated with the chosen configuration. To achieve the re-usability and reconfigurability, the X-net network directly connects each CE with its 8 nearest neighbours in bidirectional ports through the SCU component. In some cases, the extremity connections in X-net network are wrapped around to form a torus topology, which is used to facilitate the matrix computation algorithms. All SCUs have the same direction controls. In fact, each SCU can simultaneously send a data to the northern neighbour and can receive another data from its southern neighbour. The X-Net uses a bit-state signal to identify nodes that participate in communication. Inactive SCUs can be used as pipeline stages to achieve distant communication. This data transfer through networks occurs without conflicts and is achieved by COM S and COM R instructions that allow all the SCUs to communicate with their neighbours in a given direction at a given distance. cessor can be assigned to dedicated hardware modules, called IP accelerators. It allows executing a specific function more efficiently than with any processor. Regardless of the function to achieve, HW-CE is only sensitive to the trigger signal sent by the SCU controller. Once received, the HW-CE begins the execution, independently of the others CEs in the cluster. The external interface of the CE is the same whatever the nature of the component (HW-CE or SW-CE) to allow the rapid integration of the CE component in the G-MPSoC architecture. Each CE in the cluster is connected to its own data register R CEi located in SCU to store the intermediate results, needed for the communication process, and the final result that will be transmitted to the MCU. Each cluster of CEs is controlled by its SCU and only performs when receives the trigger signal. 3.1.5 Memories • Sequential memory: MCU-memory The MCU is connected to the program memory. It includes sequential instructions to be executed by MCU and parallel control instructions to be broadcast to SCUs. The data are stored in the stack of MCU processor. • Parallel distributed memories: Mi memory Each SW-CE is connected to its own local Mi memory. It is divided into an instructions memory, including parallel instructions blocks, and a data memory, including parallel data. Depending on the SW-CE nature, the designer chooses to include or not the data memory. For example, in the case where SW-CE is a stack processor, the use of data memory is not necessary. 3.1.4 Cluster of CEs The cluster of CEs is the third execution level in GMPSoC. It allows the parallel execution through either homogeneous CEs, using the same Software SW-CE, or heterogeneous CEs with different Software SW-CE and Hardware HW-CE. The G-MPSoC architecture is designed as parametric and reconfigurable system, which is able to target various applications through customized architecture. Indeed, the number of the nodes (SCUs, CEs and memory size) is parametric in G-MPSoC, allowing the easiest scalability of the system. All these nodes are connected via a reconfigurable interconnection network, which can have one topology among 4 (2D (mesh, torus) and 1D (linear, ring)), 1 direction among 8 (N, S, E, O, NE, NO, SE and SO) and 1 distance ranging between 1 and 16. To facilitate the modification and the reuse of the components that constitute the architecture, a generic implementation is defined, by which several G-MPSoC configurations can be built. To configure a G-MPSoC system from a top level design, the designer defines the entities of the modules that will be used in the system, • SW-CE: Processor Element (PE) Each PE executes its own instructions block independently from the others PEs. This autonomous execution requires the integration of a local instruction memory and a local Program Counter (PC) in each SW-CE. When the SCU orders the parallel computation, all active SWCEs start the execution of their local instructions blocks at the same address on different data. • HW-CE: IP Accelerator Some specific functions performed by the pro- E-ISSN: 2224-266X 460 Volume 14, 2015 Hana Krichene, Mouna Baklouti, Mohamed Abid Philippe Marquet, Jean-Luc Dekeyser WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS specifies their parameters and interconnects them together. Indeed, G-MPSoC instance can range from a simple configuration composed of a MCU connected to a parametric grid of SCUs, where each SCU is connected to a single CE, to a complex configuration where SCUs are interconnected via a reconfigurable neighbourhood network and each SCU is connected to a cluster of heterogeneous CEs. Therefore, from a generic architecture a tailored solution can be built, according to the application requirements, in terms of resources: computation, memorization and communication. 3.2 network. The master-slave control mechanism provides a flexible parallel execution with the use of multiple control flows globally synchronized. This flexibility increases the system scalability. 4 The choice of the G-MPSoC implementation on FPGA is justified by the flexibility and the reconfigurability of this device. It allows the implementation of generic architecture that is effectively tailored to the application requirements. The number of nodes (SCU controllers + CEs calculators) and the local memories size are parametric. Indeed, a targeted application can need many SCUs with many heterogeneous CEs using short memories, or a small amount of SCUs and CEs with large memories. This architecture is implemented with VHDL language and targets the Xilinx Virtex6 ML605 FPGA [15]. To evaluate the GMPSoC performance, we use the ISim simulator and the ISE synthesis Xilinx tools. The G-MPSoC configuration includes a single MCU and a grid of nodes. Each node is a hierarchical unit that contains a SCU and its cluster of CEs. Each CE is another hierarchical unit that contains an IP-accelerator or a PE connected to its local memory. Another intermediate hierarchical level, connecting several SCUs, has been defined to facilitate the routing process. All processors, used in G-MPSoC architecture, have a pre-existent FPGA implementation (FC16 [10], HoMade [16]...). We add some signals to be adapted to the CE generic interface and some instructions like: the wait GO instruction, to wait for the trigger signal sent from the SCU, and the end CE instruction, to inform the SCU of the end of the parallel computation. MCU is a modified FC16 processor, where mask, broadcast and wait end instructions are added. Its interface with the SCUs grid is modified to support 32 bits buses and 3 bits mask/broadcast code bus. The IP-accelerator can be a predefined Xilinx IP or an implemented IP using the RTL description. The memories modules are implemented using the existing FPGA blocks memories. The connections between the components used in this design require several signals that consume a large area. To increase the system scalability, it is necessary to optimize the use of these signals without affecting the processing. We have proposed to break all final result connections from SCUs to MCU and we have only kept the closest signals to MCUs (i.e. the SCUs in the first column). For the others, shift operations are performed to transfer the final result to the SCUs in the first column. This method allows the decrease of the system area occupancy approximately Execution model of G-MPSoC 3.2.1 Synchronous Communication • One-to-all communication: broadcast with mask The broadcast with mask [11] is a technique of dividing a network into two or more subnetworks, where the execution of different instructions is autonomous. This mechanism starts by activating the SCUs involved in the execution according to the mask sent by the MCUs. Each SCU in the grid has a unique number composed of the reference couple (X,Y). This number represents its position relative to X line and Y column. The mask codes set the activity flag of each SCU. All active SCUs execute control instructions, while the others remain in an idle state. • Collective regular communication The communication in G-MPSoC is defined by the temporal and spatial regularity of its data transfer. It is regular, by which all the nodes have the same degree of connection with their neighbours and synchronously communicate in the same direction and distance. This regular communication is managed by Send and Receive instructions. It is synchronously performed from a single control flow and locally controlled by the slave controllers. Such communication is simple to design without the overhead of synchronization mechanism. 3.2.2 Asynchronous Computation The master-slave control mechanism [14] is based on two control levels: the first one (MCU) executes sequential instructions and sends parallel control instructions to the second control level: SCUs. The second level controls the parallel communication in the interconnection network and the parallel execution in the cluster of CEs. Each CE executes its instructions stream asynchronously from the others, while the SCUs manage the synchronous data transfer in the E-ISSN: 2224-266X G-MPSoC implementation 461 Volume 14, 2015 Hana Krichene, Mouna Baklouti, Mohamed Abid Philippe Marquet, Jean-Luc Dekeyser WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS about 6.26%. 5 The Synthesis results predict the maximum frequency of a given configuration, ranged around 88.212MHz This frequency is relative to the used processors frequency (FC16 151.404MHz, HoMade 94.117MHz) and the longest critical path in the design. A good place&route of the designed components is necessary to reduce the length of this critical path and to accelerate the signal propagation. The generic feature of G-MPSoC is ensured not only by the modular and parametric structure of its different components, but also by its hierarchicaldistributed control, the diversity and the heterogeneity of its processing units, and the parametricity of its interconnection network. In this section and through several benchmarks, we validate the generic feature of the architecture and show the effectiveness of their implementation in terms of flexibility, scalability and high performance (execution time and bandwidth). To test these benchmarks, we designed the G-MPSoC architecture with VHDL language and targeted the Xilinx Virtex6-ML605 FPGA as prototyping platform. Table 4 gives the resources occupancy for the GMPSoC components on the targeted FPGA. Depending on the used CE size, G-MPSoC configuration can include 100 nodes with FC16 processor, 49 nodes with HoMade processor or more than 256 nodes with muladd accelerator. This number of component can be increased with the use of another FPGA generation with more provided hardware resources to reach the defined architecture with 256 nodes and 16 CEs per cluster (i.e. 4096 CEs). The table 4 shows the power consumption of these modules. For all of them, the power characteristics are reported as around 400mW. This low power consumption makes the choice of the integrated CE only depending on size and speed. For simple operation, it is better to integrate specialized accelerator IPs than processors, in order to raise the architecture size and to accelerate the parallel execution. Table 4 also shows that the control structure in G-MPSoC system does not clutter the occupied area. In particular, the SCU module contains four submodules that not require too much logic elements, as shown in table 5. Thereby, the total area-cost of GMPSoC system does not burst with the use of masterslave control structure. In fact, based on the work in [14], the experimental results show that for 100 nodes with clusters of a single CE (FC16 processor) connected to 4KB local memory, the grid of SCUs occupy about 38% of the total consumed on-chip logic area. For an array of 16 SCUs, it is around 16%. In fact, if the number of G-MPSoC nodes increases, the area occupancy linearly increases, but the incremental cost of adding SCU functionality to G-MPSoC control system quickly becomes small. 5.1 E-ISSN: 2224-266X LUTs 4 352 101 3 Benchmark 1: Red-Black checkerboard Master-slave control in G-MPSoC architecture is based on a hierarchical-distributed structure, which allows having several parallel processing areas selected by the sub-netting mechanism. Thereby, the MCU can manage these areas and can switch from one to another using different activity masks, during the execution of a parallel program. This flexibility to switch between the processing areas allows the simultaneous execution of both conditional structure blocks (if...then...else...). This feature facilitates the activity process and avoids having idle processing nodes when executing the conditional structure in massively parallel architecture. To highlight this feature and its impact on the proposed architecture performance, we tested the RedBlack checkerboard application with the broadcastwith-mask [11] and traditional one-to-all [19] methods. This application is used to solve partial differential equations (Laplace equation with Dirichlet boundary conditions, Poisson-Boltzmann Equation, etc), using massively parallel systems. It is based on the division of parallel processing nodes into red and black areas, where all the same colour area are performing the same instructions block. The code below is composed of a set of the added mask/broadcast instructions and a ”lit” FC16 instruction [10] to map the red-black mask into the processing grid, and then broadcast the parallel instruction to trigger the execution of the first instructions block. Using the inverted mask, the second area can be activated and the execution of the second instructions block can be triggered. The code of the broadcast-with-mask, detailed in listing 1, is implemented with only 6 instructions to subnet the (16 × 16) grid into red-black checkerboard as shown in fig. 4. Therefore, the nodes with the same colour have the same control flux and execute the same instructions blocks, independently of the others Table 5: SCU sub-modules: synthesis result on FPGA Virtex6 ML605 Sub-module Activity controller Parallel execution controller Communication controller End execution controller (ORTree) Experimental results Registers 2 80 50 0 462 Volume 14, 2015 Hana Krichene, Mouna Baklouti, Mohamed Abid Philippe Marquet, Jean-Luc Dekeyser WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS Table 4: G-MPSoC components: synthesis result on FPGA Virtex 6 ML605 CE LUTs 1132 3560 26 1139 420 FC16 HoMade muladd accelerator MCU SCU Control module Slice registers 206 493 19 201 132 nodes in the neighbour sub-network. l i t 0xAAAAAAAA / / f i n d mask A ( 2 c y c l e s ) selbf / / s e n d mask A ( 1 c y c l e ) l i t 0 x55555555 / / f i n d mask B ( 2 c y c l e s ) selbfor / / s e n d mask B ( 1 c y c l e ) l i t 0 x06000010 / / find parallel control i n s t r u c t i o n (2 cycles ) brdbf / / send p a r a l l e l c o n t r o l i n s t r u c t i o n (1 cycle ) 0 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 0 1 0 Mask A Bandwidth (MB/s) 800 780 760 740 720 one-to-all model 700 broadcast-with-mask model 680 660 640 620 4 16 64 256 CEs Figure 5: Influence of broadcast models on bandwidth Fig. 5 shows that an increasing number of nodes integrated in the G-MPSoC architecture leads to a higher gap between the bandwidth provided by the broadcast-with-mask method and one-to-all method. This result shows the importance of broadcast-withmask technique in the massively parallel systems scalability, without causing bottlenecks. However, the traditional one-to-all method is recommended with small systems because of its efficiency in terms of area cost and bandwidth. Thereby, the designer has to make the right choice between the system size and the broadcast technique to ensure the high performance needed for the parallel execution. 1 Mask B Red/Black mask 5.2 Benchmark 2: Matrix multiplication The processing units (CEs) in the G-MPSoC architecture can have several forms. They can be homogenous with the same type of processors or heterogeneous with clusters of different processors or clusters of dif- Nodes grid Figure 4: Red-Black mask E-ISSN: 2224-266X Power(W) 0.462 0.509 0.436 0.483 0.536 820 0 0 Fmax(Mhz) 151.404 94.117 624.220 153.520 256.937 840 As shown in listing 1, the red-black mask can be rapidly mapped into a grid of (16 × 16) nodes in 6 clock cycles, unlike the one-to-all method [19], that needs several clock cycles to map this mask into the processing nodes. With the one-to-all method, there must be a relationship between identity and the activity bit to activate the nodes with peer identities and disable the ones with odd identities, or to enable nodes with identity lower than a specific value and disable the others. So that to map the Red-Black mask into a grid of (16 × 16) nodes, 16 operations are required: 8 for odds and 8 for peers, executed in 20 clock cycles. Thus, we notice that the broadcast-with-mask method allows the rapid grid sub-netting on several processing areas, twice shorter than with the traditional one-to-all method. 0 FPGA occupancy <1% 2% <1% <1% <1% The parallel instructions broadcast using these two different activity control methods requires different implementations of the control structure in GMPSoC. The synthesis result given by ISE tool [17] and presented in table 6 shows the large bandwidth (∼12% higher than the one-to-all) provided by the G-MPSoC using broadcast-with-mask despite the additional consumed logic elements. This can be explained by the fact that the proposed control structure for parallel activity and parallel broadcast management is based on local controllers (SCUs), which optimize the data flow transfer. Listing 1: Broadcast-with-mask code for Red-Black checkerboard 1 DSP48E1s 1 1 - 463 Volume 14, 2015 Hana Krichene, Mouna Baklouti, Mohamed Abid Philippe Marquet, Jean-Luc Dekeyser WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS Table 6: Synthesis results on FPGA Virtex6 of G-MPSoC with different activity control methods Broadcast with mask Broadcast one-to-all LUTs 257747 54% 242768 51% FPGA occupancy Slice registers 410904% 410564% memories 102567% 102567% Performance Bandwidth(MB/s) 786 697 Fmax(Mhz) 205.999 182.846 Power(W) 0.421 0.447 G-MPSoC system with different directions and distances and using the Send/Receive instructions. This application is performed via the X-net network, using the neighbourhood/distant communications and the broadcast-with-mask method to facilitate the network sub-netting, as shown in fig. 6. ferent processors and accelerators. Each one performs a specific function or blocks of parallel instructions. To highlight the genericity and the flexibility of the CEs, we have tested the matrix multiplication example with two different G-MPSoC configurations with a grid of (8×8) nodes, where each node is composed of: • Conf.1: SCU + software CE (FC16 processor). • Conf.2: SCU + software CE (FC16 processor) + hardware CE (muladd accelerator to perform the multiplication and the addition operations). The matrix multiplication is one of the basic computational kernels in many data parallel applications. It is presented by the equation (1): cij = n X aik bkj avec 1 ≤ i, j ≤ n (1) k=1 Figure 6: Summing application steps To perform (8×8) matrix multiplication into system with (8×8) nodes, the execution needs 16 multiplications, 16 additions and 30 communications. The FC16 processor performs the multiplication in 19 clock cycles [10], whereas muladd accelerator performs the multiplication in one cycle. That is why the configuration based on clusters of (FC16 + muladd) is more efficient than the architecture based on only SW-CE (FC16). We notice that the conf.2 is the most suitable for matrix multiplication. The generic feature of G-MPSoC architectural model allows changing the system structure from a homogeneous configuration to a heterogeneous configuration according to the performance provided by the CEs and the algorithm requirements. The use of parametric modules significantly facilitates the generation of processing nodes with rapid modification of system configuration. Thereby, the G-MPSoC system is flexible, scalable and quickly adapts to applications changes. Performing the Summing application, in traditional SIMD system, needs several clock cycles, especially for node activity step and communication step. In addition, doing distant communication in [18] requires external link. Therefore, the use of the X-net network and the broadcast-with-mask mechanism improves the communication performance in G-MPSoC architecture, where X-net communication costs d (distance) cycles: the delay of data transfer between source and destination (1 cycle for neighbour communication). 200000 FPGA resources 175000 150000 125000 100000 75000 LUTs 50000 Registres 25000 0 4 16 64 256 4 mesh 16 64 torus 256 4 16 64 bus (16+1 bits) 5.3 4 16 64 256 torus bus (1bit) number of PEs (Topology/bus size) Benchmark 3: Parallel Summing To validate the flexible communication in G-MPSoC architecture, we chose the summing application [18] using a grid of 4×4 nodes. The aim of this application is to test the synchronous communication in E-ISSN: 2224-266X 256 mesh Figure 7: FPGA occupancy with different size of GMPSoC configurations 464 Volume 14, 2015 Hana Krichene, Mouna Baklouti, Mohamed Abid Philippe Marquet, Jean-Luc Dekeyser WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS 600 5.4 550 450 Bandwidth (MB/s) Benchmark 4: FIR filter The digital Finite Impulse Response (FIR) [21] filters are widely used in digital signal processing to reduce certain undesired aspects. A FIR structure is described by the differential equation (2): 500 400 350 300 250 200 150 y(n) = 100 50 h(k) × x(n − k) (2) k=0 0 4 16 64 256 4 16 mesh 64 256 4 torus 16 64 256 4 mesh bus size (16+1) bits 16 64 256 It is a linear equation, where X represents the input signal and Y presents the output signal. The order of the filter is given by the parameter N, and H(k) represents the filter coefficients. The number of inputs and outputs data is equal to n. To emphasize the advantage of communication networks in the execution of data-parallel programs, we have implemented the FIR filter with two methods: 1st method: 2D configuration This method is inspired from the work presented in [20]. It is an implementation of FIR filter as defined in its equation (2). The system is composed of a MCU, a grid of (4×4) nodes (SCU + PE) and an X-net network in torus topology. We assume that the FIR system takes 16 H(k) parameters and 64 X(n) inputs. The algorithm is described as the following steps: torus bus size 1 bit CEs number Figure 8: Influence of CEs number and bus sizes on bandwidth Fig. 7 and fig. 8 present synthesis and bandwidth results on different system topologies and bus sizes. We note a compromise between area and bandwidth. Indeed, the configuration integrating the X-net interconnect with (16+1) bits buses gives efficient data transfer with large bandwidth, but it occupies a large chip area (2 times higher), as shown in fig. 7. In addition, if the number of the nodes is multiplied by a factor of 16, the bandwidth decreases by factor of 2 and the FPGA area increases by a factor of 4, which is an acceptable rate. 1. Data initialization: each H(k) is stored in PE(i,j) local data memory according to the following function: k = 4 × i + j and all X(n) values are stored in each PE memory. 300 250 Latency (cycle) N −1 X 200 2. Multiplication of all the inputs with H(k) in each PE(i,j). bus(16+1)bits 150 bus(4+1)bits 100 bus 1 bit 3. Communication: as shown in fig. 10, all PEs do West communication to send first multiplication element to their neighbour; then only PEs in the last column do North-communication. 50 0 0 2 6 distance 10 14 4. Addition of this new value with the second local multiplication value in each PE(i,j). Figure 9: Influence of buses size on communication delay 5. Repeat 4) until obtain all the results which will be stored in PE(0,0) memory. We have also tested the communication delay with the use of the different bus sizes. Fig. 9 shows that the communication time is 17 times higher with 1-bit bus than with 17-bits bus. Despite this tedious communication, 1-bit data transfer allows the use of relatively simple buses with low hardware cost, which increase system scalability. These previous experimental results show the efficiency of the integration of reconfigurable interconnection network in G-MPSoC. Depending on the application needs, the designer can select the network parameters, which guarantee the less communication delay. E-ISSN: 2224-266X To perform this FIR filter algorithm, (64×2) communications are required. The multiplication and the addition operations are performed in parallel. 2nd method: 1D configuration This method is inspired from the work presented in [23]. In this implementation, the system is composed of a MCU, 16 slaves and X-net network in linear topology. We also assume that the FIR system takes 16 H(k) parameters and 64 X(n) inputs. All H(k) are stored in MCU and sent to PE(i) when 465 Volume 14, 2015 Hana Krichene, Mouna Baklouti, Mohamed Abid Philippe Marquet, Jean-Luc Dekeyser WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS PE 00 PE 01 PE 02 PE 03 PE 10 PE 11 PE 12 PE 13 PE 20 PE20 PE 21 PE 22 PE 23 PE 30 PE30 PE 31 PE31 PE 32 PE32 PE 33 PE33 Depending on the application needs, the designer can select the most appropriate network configuration to his system. The different topologies that can support the X-net offer diverse choice of 1D or 2D configurations as well as at the size of the interconnect bus, to ensure data transfer rapidly and with low cost. 6 This paper presents a new generation of massively parallel System-on-Chip based on generic structure, called G-MPSoC platform. It is a configurable system, composed of clusters of hardware and/or software CEs, locally controlled by a grid of SCUs and globally orchestrated by the MCU. All the CEs can communicate between each other via the SCUs components that are connected through regular X-net interconnection network. G-MPSoC architecture is entirely described in VHDL Language, in order to allow rapid prototyping and testing, using the synthesis and simulation tools. The execution model of GMPSoC is detailed to highlight the advantages of the synchronous communication, based on broadcast with mask structure and regular communication network, and asynchronous computation, based on master-slave control structure. An FPGA Hardware implementation of G-MPSoC platform is also presented in this paper and validated through several parallel applications. Different configurations are tested going from a simple homogeneous structure to a complex heterogeneous structure. All configurations tested different X-net network topologies, different data buses sizes and different memories sizes. In this work, we define a generic massively parallel system on chip that is able to be quickly adapted to the applications requirements. The next work is to define a new weakly coupled massively parallel execution model for G-MPSoC platform, based on Synchronous Communication and Asynchronous Computation. This model is classified between the synchronous centralized SIMD model and the asynchronous decentralized MIMD model. It takes advantages of these two models to allow more performance needed for the execution of nowadays intensive processing applications. Figure 10: FIR filter implementation in 2Dconfiguration Table 7: 16-order FIR with inputs implementation the inputs (n) Com execution time (cycle) on SIMD mode 8 16 64 G-MPSoC Method 1 G-MPSoC Method 2 ESCA 126 270 1134 72 144 576 255 351 1158 reconfSIMD on-chip 332 744 - they are needed in the algorithm. The input X(n) are shifted inter-slaves using West-communication, as presented in fig. 11. The n outputs are calculated using multiplication and addition instructions, in the alternative way. To do this algorithm, only 63 communications are required but the algorithm is not totally parallel. PE013 PE014 PE015 MCU PE00 PE01 PE02 PE03 PE012 0 0 0 0 0 0 0 x(0) h(15) 0 0 0 0 0 0 x(0) x(1) h(14) 0 0 0 0 0 x(0) x(1) x(2) h(13) 0 .. 0 .. 0 .. 0 .. x(0) .. x(1) .. x(2) .. x(3) .. h(12) .. x(0) .. x(1) .. x(2) .. x(3) .. x(12) .. x(13) .. x(14) .. x(15) .. h(3) x(47) x(48) x(49) x(50) x(59) x(60) x(61) x(62) h(1) h(2) h(0) Figure 11: FIR filter implementation in 1Dconfiguration Experimental results The experimental results in table 7 show the time needed for data transfer in FIR filter application. As expected, the G-MPSoC architecture allows more rapid processing than both reconfigurable SIMD architecture on-Chip [23] and ESCA [20] architectures. We deduce that G-MPSoC architecture based on linear interconnection topology is the most effective for the FIR application. E-ISSN: 2224-266X Conclusion References: [1] D. Melpignano, L. Benini, , E. Flamand, et al. Platform 2012, a many-core computing accelerator for embedded SoCs: performance evaluation of visual analytics applications. Proc. Int. Conf. Design Automation Conference, New York, USA, 2012, pp. 1137–1142. 466 Volume 14, 2015 Hana Krichene, Mouna Baklouti, Mohamed Abid Philippe Marquet, Jean-Luc Dekeyser WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS [2] SIMD < SIMT < SMT: parallelism in NVIDIA GPUs. http://yosefk.com/blog/simd-simt-smtparallelism-in-nvidia-gpus.html [3] PU-based Image Analysis on Mobile Devices. http://arxiv.org/pdf/1112.3110v1.pdf. [4] D.B. Thomas, L. Howes and W. Luk. A Comparison of CPUs, GPUs, FPGAs, and Massively Parallel Processor Arrays for Random Number Generation. Proc. Int. Conf. Field Programmable Gate Arrays, New York, USA, 2009, pp. 63–72. [5] F. Hannig, V. Lari, S. Boppu, et al. Invasive Tightly-Coupled Processor Arrays: A DomainSpecific Architecture/Compiler Co-Design Approach, ACM transactions on Embedded Systems, 13, 2014, pp. 133:1-133:29. [6] F. Conti, C. Pilkington, A. Marongiu, et al. HeP2012: Architectural Heterogeneity Exploration on a Scalable Many-Core Platform. Proc. Int. Conf. Application-specific Systems, Architectures and Processors, Zurich, Swiss, 2014, pp. 114-120. [7] Many-core Kalray MPPA, http://www.kalray.eu [8] The HyperCore Processor, http://www.plurality.com/hypercore.html [9] Next Generation CUDA Compute Architecture: Fermi WhitePaper, http://i.dell.com/sites/doccontent/sharedcontent/data-sheets/en/Documents/NVIDIAFermi-Compute-Architecture-Whitepaperen.pdf [10] R.E. Haskell and D.M. Hanna, A VHDL Forth Core for FPGAs, Journal of Microprocessors and Microsystems, 29, 2009, pp. 115-125. [11] H. Krichene, M. Baklouti, Ph. Marquet, et al. Broadcast with mask on Massively Parallel Processing on a Chip. Proc. Int. Conf. High Performance Computing and Simulation, Madrid, Spain, 2012, pp. 275–280. [12] C.E. Leiserson, Z.S. Abuhamdeh, D.C. Douglas, et al. The network architecture of the connection machine CM-5, Journal of Parallel and Distributed Computing, 33, 1996, pp. 145-158. [13] S.L. Scott. Synchronization and communication in the T3E multiprocessor. Proc. Int. Conf. Architectural Support for Programming Languages and Operating Systems, New York, USA, 1996, pp. 26–36. [14] H. Krichene, M. Baklouti, Ph. Marquet, et al. Master-Slave Control structure for massively parallel System-on-Chip. Proc. Int. Conf. Euromicro Conference on Digital System Design, Santander, Spain, 2013, pp. 917–924. E-ISSN: 2224-266X [15] ML605 Hardware User Guide - UG534 (v1.8), http://www.xilinx.com/support/documentation/ boards and kits/ug534.pdf. [16] HoMade processor, https://sites.google.com/site/homadeguide/home. [17] Xilinx website, https://www.xilinx.com. [18] M. Leclercq and P.Y. Aquilanti, X-Net network for MPPSoC, Master thesis, University of Lille 1, 2006. [19] M. Baklouti, A rapid design method of a massively parallel System on Chip: from modeling to FPGA implementation. PhD thesis, University of Lille 1 & University of Sfax, 2010. [20] P. Chen, K. Dai, D. Wu, et al. Parallel Algorithms for FIR Computation Mapped to ESCA Architecture. Proc. Int. Conf. Information Engineering, Beidaihe, China, 2010, pp. 123–126. [21] S.M. Kuo and W.S. Gan, Digital Signal Processors Architectures, implementation and application. Prentice Hall, 2005. [22] J.R. Nickolls, The design of the MasPar MP1: a cost effective massively parallel computer, IEEE Computer Society International Conference, Compcon Spring, San Francisco, USA, 1990, pp. 25–28. [23] J. Andersson, M. Mohlin and A. Nilsson, A reconfigurable SIMD architecture on-chip. Master Thesis in Computer System Engineering - Technical Report, School of Information Science, Computer and Electrical Engineering - Halmstad University, 2006. [24] Sh. Raghav, A. Marongiu and D. Atienza, GPU Acceleration for Simulating Massively Parallel Many-Core Platforms. Journal of Parallel and Distributed Systems, 26, 2015, pp. 1336–1349. [25] D. Walsh and P. Dudek , An Event-Driven Massively Parallel Fine-Grained Processor Array. Proc. Int. Conf. Circuits and Systems (ISCAS), Lisbon, Portugal, 2015, pp. 1346–13496. 467 Volume 14, 2015

RELATED PAPERS

RELATED TOPICS

Log In

G-MPSoC: Generic Massively Parallel Architecture on FPGA

G-MPSoC: Generic Massively Parallel Architecture on FPGA

Related Papers

RELATED PAPERS

RELATED TOPICS