Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Software Implementation vs. Hardware Implementation: The Avionic Test System Case-Study George Afonso, Rabie Ben Atitallah, Jean-Luc Dekeyser To cite this version: George Afonso, Rabie Ben Atitallah, Jean-Luc Dekeyser. Software Implementation vs. Hardware Implementation: The Avionic Test System Case-Study. ASPLOS, Mar 2012, London, United Kingdom. ฀hal-00665162฀ HAL Id: hal-00665162 https://hal.inria.fr/hal-00665162 Submitted on 1 Feb 2012 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Software Implementation vs. Hardware Implementation: The Avionic Test System Case-Study George AFONSO Rabie Ben Atitallah Jean-Luc Dekeyser EADS-INRIA Lille Nord Europe LAMIH, University of Valenciennes LIFL, USTL, INRIA Lille-Nord Europe george.afonso@inria.fr rabie.benatitallah@univvalenciennes.fr ABSTRACT T2 This paper presents a development methodology that helps designers to map efficiently applications onto a heterogeneous CPU/FPGA system. An industrial case study is presented and aims at meeting performance and real-time requirements with the help of our architecture capabilities and avionic model parallelization. Different avionic model implementations will be presented in order to explain how to find the best trade-off between performance and design-time. 1. INTRODUCTION In order to meet real-time requirements, power, and flexibility goals, combination of general CPU and reconfigurable fabrics like Field-Programmable Gate Aarrays (FPGAs) is a promising solution leading to heterogeneous computing. In such systems, multicore CPU provides high computation rates while the reconfigurable logic offers high performance per watt and adaptability to the application constraints. Due to the high parallelism rate of the application, FPGA technology could offer better performances comparing to CPUs or GPUs up to 10x [1] at lower frequencies. Designers could exploit the existing partitioning of the architecture which leads to several possible implementations with different performances. The main focus of this paper is the task mapping taking into account the different constraints of our application 2. DESIGN METHODOLOGY CPU/FPGA ARCHITECTURE FOR In order to perform a complete Test and Simulation session, we need to simulate each embedded part of the helicopter (i.e. automatic pilot system, navigation, etc.), the environmental parameters (weather conditions, geographical factors, etc.), and the behaviour of the aircraft. This simulation models will be implemented in a software and/or hardware fashion in order to satisfy the timing constraints. In order to get the best performance, sequential model profiling must be performed in order to exploit the parallelism level inherent to our functions. Thus, we will determine the best trade-off between parallel software, hardware or software/hardware execution. . jean-luc.dekeyser@lifl.fr T5 T8 T4 System ressources T9 T1 T6 T3 T1 Parallelism extraction Application contraints VfEmbedded Hardware/Software mapping Heuristic Hardware code generation Software tasks Communication synthesis Multi-Core CPU C, C++, ... T7 FAST LINK ROCCC Hardware tasks FPGA Figure 1: Design methodology for efficient test mapping To overcome this challenge, we present a design methodology that covers the different development steps from software specification to the system implementation as shown in Fig. 1. First, we are considering a software application presented as a task graph containing different communicating functions (T0, T1, etc.). All applications are not adequate to be implemented onto heterogeneous CPU/FPGA architectures; a complete analysis of the source code is needed to verify what implementation could bring better performances. In order to leverage the parallelism of multi-core CPU/FPGA architecture, tools such as Vector Fabrics VfEmbedded [2] can find all data dependencies by analysing the C or C++ source code and extract the parallelism intrinsic in the application. vfEmbedded analyses partitions and maps applications on specific platforms from single processors to heterogeneous ones. It can also estimate the performance of the parallelized software before implementing it. Moreover, it can trim any overhead in your hardware to reduce cost and ensure that all critical behaviours in your program are exercised. The mapping step requires a heuristic method that takes as inputs system resources, application constraints, and the results of the analysis step. It will find the best configurations which satisfy timing requirements and resources utilization. After the mapping step, we need to develop some user hardware applications from the existing functions. To make this step more efficient, tools such as Riverside Optimizing Compiler for Configurable Computing (ROCCC) [3] can focus on FPGA-based code acceleration from a subset of the C language. ROCCC does not focus on the generation of arbitrary hardware circuits. Its objectives are to maximize parallelism within the constraints of the tar- get device, optimize clock cycle time by efficient pipelining, and minimize the used area. Finally, a compilation step using GCC and ISE from Xilinx, can be easily performed in order to map functions respectively onto the CPU and the FPGA parts. to 14 µs respectively with an area utilization varying from 5,800 to 14,300 slices. Indeed, more we parallelize the model F main loop, the occupied hardware increases and the execution time is reduced. The software implementation on the host offers an execu- 3. EXPERIMENTAL RESULTS ,$#$$% In this section, we will analyse different avionic models in order to obtain different possible implementations required to tune the design according to the needed performance and the real-time constraints. These implementations are also useful to switch between software and hardware configurations or vice-versa. First, as our objective is to analyse our software models in order to get better performance in our heterogeneous multi-core CPU/FPGA architecture, we will profile six different software models. To do so, we used VfEmbedded [2] tool presented previously which offers a "parallelize" function analyses all loops and parallelizes them, correctly handling the dependencies. !+#$$% -./01234%25/%6789% -./01234%25/%6789% !*#$$% !)#$$% !(#$$% !'#$$% !&#$$% '# *, % (# "$ % (# )( % )# ,' % )# )! % *# ,$ % *# (& % +# !' % +# (! !$ % #! $ !$ % #' ' !! % #$ & !! % #' ! !, % #+ & !" % #& ! !" % #* + !& % #" '% !"#$$% :15;/<%3=%8>?0/8%6@9% Results Model A Model B Model C Model D Model E Model F Table 1: Avionic models analysis Speed-up Max. of use- Synchronization ful threads overhead 1.2 2 1% 0 1 0% 2.4 3 0% 3.5 6 39% 3 4 29% 486.7 10000 95% Table I summarizes the experimental results obtained by analysing software avionic models on VfEmbedded with x86 Intel i5 processor execution support. First, we measure the speedup obtained after optimization. Secondly, threads must be created for parallel implementation strategy. This might be implemented through the use of POSIX calls creating the threads. The maximum useful number of threads is directly linked to the parallelism degree of the application. But threads means synchronisation and more frequent synchronization of small amounts of data means less delay while waiting for data and therefore less latency. Synchronization requires overhead and this could require lot of time and make parallelization step fail. VfEmbedded shows a 1.2 speed-up for the model A with low synchronization overhead with only two threads. Model B cannot make profit from a parallelization strategy. These two first models are more suitable to be implemented on a single core architecture. Model C, D and E offer higher parallelism degree with low synchronisation overhead for model C. These models are suitable for multi-core architecture implementation or hybrid multicore CPU/FPGA implementation by splitting the models in different functions and implement them in different calculation nodes. This, is also possible because of the low synchronization overhead. Finally, Model F shows a high parallelism degree with high synchronisation overhead, this model is very adapted to a hardware implementation, in the second part of our results we will generate VHDL code using ROCCC compilation framework. Secondly, using our previous results, we decide to implement model F in a hardware fashion using a C to VHDL translator. The model F main loop fits perfectly to such transformation. With the help of the ROCCC tool, a compilation step from C to VHDL is performed. Furthermore, we make profit from the loop unrolling feature provided by ROCCC in order to obtain varied implementations for the model F main loop. We synthesised 17 samples with varying the loop unrolling value. Implementation results are reported in the Fig. 2. We get an execution time varying from 19 Figure 2: Different implementations for model F tion time equals to 18 µs. The selection of the best implementation will depend on the global system constraints, the data mapping, and the available resources. ROCCC shows some limitations on more complex models, in this case manual coding becomes necessary. Using our previous results, we decide to implement model A in order to observe the behaviour of such model in a pure VHDL hardware implementation. The pure software version, we obtained a 2 µs execution time with a Quad-core processor (4x2.5 GHz). Despite VfEmbedded does not show a large parallelism degree for this model, most of the mathematical operations do not have data dependency between them. A pure VHDL implementation offers a 2 µs execution time with 8% space occupation (Xilinx ML605 Board) which is related to the pure software execution time. This result offers the opportunity to move model A from a processor to the FPGA in order to bring performance and respect the needed real-time constraints at the price of the hardware design time. 4. CONCLUSION In this paper, we have highlighted the benefits of some profiling tools making Test & Simulation model optimisation easier on the CPU/FPGA system. In future works, our investigation will concern a run-time task mapping for dynamically reconfigurable system which will map optimally highly communicating hardware and software tasks and deals efficiently with the violation of timing constraints. 5. REFERENCES [1] S. Asano, T. Maruyama, and Y. Yamaguchi, “Performance Comparison of FPGA, GPU AND CPU in Image Processing,” in 19th IEEE International Conference on Field Programmable Logic and Applications, FPL, Prague, Czech Republic, Aug. 2009. [2] Minjang Kim and Hyesoon Kim and Chi-Keung Luk, “A scalable approach to dynamic data-depedence profiling,” in 43rd Symposium on Microarchitecture (MICRO), December 2010. [3] N. W. Villarreal Jason, Park Adrian and H. Robert, “Designing Modular Hardware Accelerators in C with ROCCC 2.0,” in 18th Symposium on Field-Programmable Custom Computing Machines (FCCM’10), Washington, USA, May 2010.