Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Performance Portability Across Heterogeneous SoCs Using a Generalized Library-Based Approach

Published: 01 June 2014 Publication History

Abstract

Because of tight power and energy constraints, industry is progressively shifting toward heterogeneous system-on-chip (SoC) architectures composed of a mix of general-purpose cores along with a number of accelerators. However, such SoC architectures can be very challenging to efficiently program for the vast majority of programmers, due to numerous programming approaches and languages. Libraries, on the other hand, provide a simple way to let programmers take advantage of complex architectures, which does not require programmers to acquire new accelerator-specific or domain-specific languages. Increasingly, library-based, also called algorithm-centric, programming approaches propose to generalize the usage of libraries and to compose programs around these libraries, instead of using libraries as mere complements.
In this article, we present a software framework for achieving performance portability by leveraging a generalized library-based approach. Inspired by the notion of a component, as employed in software engineering and HW/SW codesign, we advocate nonexpert programmers to write simple wrapper code around existing libraries to provide simple but necessary semantic information to the runtime. To achieve performance portability, the runtime employs machine learning (simulated annealing) to select the most appropriate accelerator and its parameters for a given algorithm. This selection factors in the possibly complex composition of algorithms used in the application, the communication among the various accelerators, and the tradeoff between different objectives (i.e., accuracy, performance, and energy).
Using a set of benchmarks run on a real heterogeneous SoC composed of a multicore processor and a GPU, we show that the runtime overhead is fairly small at 5.1% for the GPU and 6.4% for the multi-core. We then apply our accelerator selection approach to a simulated SoC platform containing multiple inexact accelerators. We show that accelerator selection together with hardware parameter tuning achieves an average 46.2% energy reduction and a speedup of 2.1× while meeting the desired application error target.

References

[1]
A. Agarwal, M. Rinard, S. Sidiroglou, S. Misailovic, and H. Hoffmann. 2009. Using Code Perforation to Improve Performance, Reduce Energy Consumption, and Respond to Failures. Technical Report. MIT.
[2]
C. Alvarez, J. Corbal, and M. Valero. 2005. Fuzzy memoization for floating-point multimedia applications. IEEE Transactions on Computing 54, 7 (2005), 922--927.
[3]
E. Amigó, J. Gonzalo, J. Artiles, and F. Verdejo. 2009. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval 12, 4 (2009), 461--486.
[4]
J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. 2009. PetaBricks: A language and compiler for algorithmic choice. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 38--49.
[5]
J. Ansel, Y. L. Wong, C. Chan, M. Olszewski, A. Edelman, and S. Amarasinghe. 2011. Language and compiler support for auto-tuning variable-accuracy algorithms. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 85--96.
[6]
P. Arató, Z. A. Mann, and A. Orbán. 2005. Extending component-based design with hardware components. Science of Computer Programming 56, 1--2 (2005), 23--39.
[7]
C. Augonnet, S. Thibault, R. Namyst, and P. A. Wacrenier. 2009. StarPU: A unified platform for task Scheduling on heterogeneous multicore architectures. In Proceedings of the 15th International Euro-Par Conference (Euro-Par). 863--874.
[8]
W. Baek and T. M. Chilimbi. 2010. Green: A framework for supporting energy-conscious programming using controlled approximation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 198--209.
[9]
M. Becchi, S. Byna, S. Cadambi, and S. Chakradhar. 2010. Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory. In Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 82--91.
[10]
N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. 2011. The gem5 simulator. ACM SIGARCH Computer Architecture News 39, 2 (2011), 1--7.
[11]
K. J. Brown, A. K. Sujeeth, H. J. Lee, T. Rompf, H. Chafi, M. Odersky, and K. Olukotun. 2011. A heterogeneous parallel framework for domain-specific languages. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). 89--100.
[12]
Cacti 5.3. 2008. CACTI. Available at: http://hpl.hp.com/research/cacti/.
[13]
N. P. Carter, A. Agrawal, S. Borkar, R. Cledat, H. David, D. Dunning, J. Fryman, I. Ganev, R. A. Golliver, R. Knauerhase, R. Lethin, B. Meister, A. K. Mishra, W. R. Pinfold, J. Teller, J. Torrellas, N. Vasilache, G. Venkatesh, and J. Xu. 2013. Runnemede: An architecture for ubiquitous high-performance computing. In Proceedings of the IEEE 19th International Symposium on High Performance Computer Architecture (HPCA). 198--209.
[14]
B. Catanzaro, S. Kamil, Y. Lee, K. Asanovic, J. Demmel, K. Keutzer, J. Shalf, K. Yelick, and A. Fox. 2009. SEJITS: Getting productivity and performance with selective embedded JIT specialization. Programming Models for Emerging Architectures 1, 1 (2009), 1--9.
[15]
W. Cesário, A. Baghdadi, L. Gauthier, D. Lyonnard, G. Nicolescu, Y. Paviot, S. Yoo, A. Jerraya, and M. Diaz-Nava. 2002. Component-based design approach for multicore SoCs. In Proceedings of the 39th Annual Design Automation Conference (DAC). 789--794.
[16]
L. N. B. Chakrapani, K. K. Muntimadugu, A. Lingamneni, J. George, and K. V. Palem. 2008. Highly energy and performance efficient embedded computing through approximately correct arithmetic: A mathematical foundation and preliminary experimental validation. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES). 187--196.
[17]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). 44--54.
[18]
Y. Chen, S. Fang, L. Eeckhout, O. Temam, and C. Wu. 2012. Iterative optimization for the data center. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 49--60.
[19]
P. H. Cheung, K. Hao, and F. Xie. 2007. Component-based hardware/software co-simulation. In Proceedings of the 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD). 265--270.
[20]
G. Diamos. 2008. Harmony: an execution model and runtime for heterogeneous many core systems. In Proceedings of the International Symposium on High-Performance Parallel and Distributed Computing (HPDC). 197--200.
[21]
R. Dolbeau, S. Bihan, and F. Bodin. 2007. HMPP: A hybrid multi-core parallel programming environment. In Proceedings of the Workshop on GPGPU. CAPS Entreprise, 1--5.
[22]
C. Dubach, P. Cheng, R. Rabbah, D. F. Bacon, and S. J. Fink. 2012. Compiling a high-level language for GPUs (via language support for architectures and compilers). In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 1--12.
[23]
H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA). 365--376.
[24]
H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. 2012a. Architecture support for disciplined approximate programming. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 301--312.
[25]
H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. 2012b. Neural Acceleration for General-Purpose Approximate Programs. In 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1--6.
[26]
K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. 2006. Sequoia: programming the memory hierarchy. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC). Article 83.
[27]
D. Grewe, Z. Wang, and M. F. P. O’Boyle. 2013. Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 1--10.
[28]
G. Heineman and W. Councill. 2001. Component-based Software Engineering: Putting the Pieces Together. Addison-Wesley Longman, Boston, MA.
[29]
H. Hoffmann, S. Sidiroglou, M. Carbin, S. Misailovic, A. Agarwal, and M. Rinard. 2011. Dynamic knobs for responsive power-aware computing. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 199--212.
[30]
S. Hong and H. Kim. 2010. An integrated GPU power and performance model. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA). 280--289.
[31]
Y. Huang, P. Ienne, O. Temam, and C. Wu. 2013. Elastic CGRAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA). 171--180.
[32]
Intel. 2011. Intel64 and IA-32 Architectures Software Developer’s Manual. Intel.
[33]
N. Jiang, D. U. Becker, G. Michelogiannakis, J. Balfour, B. Towles, D. E. Shaw, J. Kim, and W. J. Dally. 2013. A detailed and flexible cycle-accurate network-on-chip simulator. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 86--96.
[34]
A. B. Kahng, Bin Li, Li-Shiuan Peh, and K. Samadi. 2009. ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE). 423--428.
[35]
L. V. Kalé and S. Krishnan. 1993. CHARM++: A portable concurrent object oriented system based on C++. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA). 91--108.
[36]
C. Kessler, U. Dastgeer, S. Thibault, R. Namyst, A. Richards, U. Dolinsky, S. Benkner, J. L. Traff, and S. Pllana. 2012. Programmability and performance portability aspects of heterogeneous multi-/manycore systems. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE). 1403--1408.
[37]
M. D. Kruijf, S. Nomura, and K. Sankaralingam. 2010. Relax: An architectural framework for software recovery of hardware faults. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA). 497--508.
[38]
T. Lane and the Independent JPEG Group. 1991. Libjpeg. Available at: http://libjpeg.sourceforge.net/.
[39]
S. Lee and R. Eigenmann. 2010. OpenMPC: Extended OpenMP programming and tuning for GPUs. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--11.
[40]
H. Li, W. He, Y. Chen, L. Eeckhout, O. Temam, and C. Wu. 2012. SWAP: Parallelization through algorithm substitution. IEEE Micro 32, 4 (2012), 54--67.
[41]
F. Lin, Z. Wang, and R. LiKamWa. 2012. Reflex: Using low-power processors in smartphones without knowing them. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 13--24.
[42]
M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng. 2008. Merge: A programming model for heterogeneous multi-core systems. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 287--296.
[43]
A. Lingamneni, C. Enz, J. L. Nagel, K. Palem, and C. Piguet. 2011. Energy parsimonious circuit design through probabilistic pruning. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE). 1--6.
[44]
S. Liu, K. Pattabiraman, T. Moscibroda, and B. G. Zorn. 2009. Flicker: Saving Refresh-Power in Mobile Devices through Critical Data Partitioning. Technical Report. Microsoft Research.
[45]
K. Lu, D. Muller-Gritschneder, and U. Schlichtmann. 2012. Accurately timed transaction level models for virtual prototyping at high abstraction level. In Design, Automation Test in Europe Conference Exhibition (DATE). 135--140.
[46]
C. K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. Reddi, and K. Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 190--200.
[47]
C. K. Luk, S. Hong, and H. Kim. 2009. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 45--55.
[48]
M. Maggio, H. Hoffmann, M. Santambrogio, A. Agarwal, and A. Leva. 2013. Power optimization in embedded systems via feedback control of resource allocation. IEEE Transactions on Control Systems Technology 21 (2013), 239--246.
[49]
MAGMA 2011-2013. MAGMA: http://icl.cs.utk.edu/magma/index.html. (2011--2013).
[50]
G. Martin, R. Seepold, T. Zhang, L. Benini, and G. De Micheli. 2001. Component selection and matching for IP-based design. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE). 40--46.
[51]
N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. 1953. Equation of state calculations by fast computing machines. Journal of Chemical Physics 21, 6 (1953), 1087--1092.
[52]
S. Misailovic, S. Sidiroglou, H. Hoffman, and M. Rinard. 2010. Quality of Service Profiling. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering (ICSE). 25--34.
[53]
A. Nayak, M. Haldar, A. Kanhere, P. Joisha, N. Shenoy, A. Choudhary, and P. Banerjee. 2000. A library based compiler to execute MATLAB programs on a heterogeneous platform. In Proceedings of the Conference on Parallel and Distributed Computing Systems (PDCS). 1--9.
[54]
J. Oberleitner and T. Gschwind. 2002. Composing Distributed Components with the Component Workbench. In Proceedings of the 3rd International Conference on Software Engineering and Middleware (SEM). 102--114.
[55]
OMPSs 2010. OMPSs. https://pm.bsc.es/ompss. (2010).
[56]
OpenACC 2013. OpenACC 2.0. Available at http://www.openacc-standard.org/.
[57]
M. Papakipos. 2006. The PeakStream platform: High-productivity software development for multi-core processors. In Proceedings of the Los Alamos Computer Science Institute, Workshop on Heterogeneous Computing (LACSI). 1--10.
[58]
P. M. Phothilimthana, J. Ansel, J. Ragan-Kelley, and S. Amarasinghe. 2013. Portable performance on heterogeneous architectures. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 431--444.
[59]
M. Rinard. 2006. Probabilistic accuracy bounds for fault-tolerant computations that discard tasks. In Proceedings of the 20th Annual International Conference on Supercomputing (ICS). 324--334.
[60]
F. Rincon, J. Barba, F. Moya, F. J. Villanueva, D. Villa, J. Dondo, and J. C. Lopez. 2007. Unified Inter-Communication Architecture for Systems-on-Chip. In Proceedings of the 18th IEEE/IFIP International Workshop on Rapid System Prototyping (RSP). 17--26.
[61]
P. S. Roop, A. Sowmya, and S. Ramesh. 2000. Automatic component matching using forced simulation. In Proceedings of the 13th International Conference on VLSI Design (VLSI Design). 64--69.
[62]
A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman. 2011. EnerJ: Approximate data types for safe and general low-power computation. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 164--174.
[63]
M. Sgroi, M. Sheets, A. Mihal, K. Keutzer, S. Malik, J. Rabaey, and A. Sangiovanni-Vencentelli. 2001. Addressing the system-on-a-chip interconnect woes through communication-based design. In Proceedings of the 38th Annual Design Automation Conference (DAC). 667--672.
[64]
A. Sidelnik, S. Maleki, B. L. Chamberlain, M. J. Garzarán, and D. Padua. 2012. Performance portability with the Chapel language. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS). 582--594.
[65]
J. Sorber, A. Kostadinov, M. Garber, M. Brennan, M. D. Corner, and E. D. Berger. 2007. Eon: A language and runtime system for perpetual systems. In Proceedings of the 5th International Conference on Embedded Networked Sensor Systems (SenSys). 161--174.
[66]
O. Temam. 2012. A defect-tolerant accelerator for emerging high-performance applications. In Proceedings of the 39th annual international symposium on Computer architecture (ISCA). 356--367.
[67]
G. Wang, D. Anand, N. Butt, A. Cestero, M. Chudzik, J. Ervin, S. Fang, G. Freeman, H. Ho, B. Khan, B. Kim, W. Kong, R. Krishnan, S. Krishnan, O. Kwon, J. Liu, K. McStay, E. Nelson, K. Nummy, P. Parries, J. Sim, R. Takalkar, A. Tessier, R. M. Todi, R. Malik, S. Stiffler, and S. S. Iyer. 2009. Scaling deep trench based eDRAM on SOI to 32nm and Beyond. In Proceedings of the IEEE International Electron Devices Meeting (IEDM). 1--4.
[68]
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. 2004. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing. 13 (2004), 600--612.

Cited By

View all
  • (2020)Exploiting Errors for EfficiencyACM Computing Surveys10.1145/339489853:3(1-39)Online publication date: 12-Jun-2020
  • (2019)mARGOt: A Dynamic Autotuning Framework for Self-Aware Approximate ComputingIEEE Transactions on Computers10.1109/TC.2018.288359768:5(713-728)Online publication date: 1-May-2019
  • (2019)HeteroMap: A Runtime Performance Predictor for Efficient Processing of Graph Analytics on Heterogeneous Multi-Accelerators2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS.2019.00039(268-281)Online publication date: Mar-2019
  • Show More Cited By

Index Terms

  1. Performance Portability Across Heterogeneous SoCs Using a Generalized Library-Based Approach

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 11, Issue 2
    June 2014
    210 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/2639036
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 June 2014
    Accepted: 01 January 2014
    Revised: 01 December 2013
    Received: 01 June 2013
    Published in TACO Volume 11, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. SoC
    2. approximate computing
    3. heterogeneity
    4. library-based programming
    5. performance portability

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)64
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Exploiting Errors for EfficiencyACM Computing Surveys10.1145/339489853:3(1-39)Online publication date: 12-Jun-2020
    • (2019)mARGOt: A Dynamic Autotuning Framework for Self-Aware Approximate ComputingIEEE Transactions on Computers10.1109/TC.2018.288359768:5(713-728)Online publication date: 1-May-2019
    • (2019)HeteroMap: A Runtime Performance Predictor for Efficient Processing of Graph Analytics on Heterogeneous Multi-Accelerators2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS.2019.00039(268-281)Online publication date: Mar-2019
    • (2018)High-Performance Molecular Dynamics Simulation for Biological and Materials Sciences: Challenges of Performance Portability2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)10.1109/P3HPC.2018.00004(1-13)Online publication date: Nov-2018
    • (2016)Proactive Control of Approximate ProgramsACM SIGARCH Computer Architecture News10.1145/2980024.287240244:2(607-621)Online publication date: 25-Mar-2016
    • (2016)Proactive Control of Approximate ProgramsACM SIGOPS Operating Systems Review10.1145/2954680.287240250:2(607-621)Online publication date: 25-Mar-2016
    • (2016)Proactive Control of Approximate ProgramsACM SIGPLAN Notices10.1145/2954679.287240251:4(607-621)Online publication date: 25-Mar-2016
    • (2016)Proactive Control of Approximate ProgramsProceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/2872362.2872402(607-621)Online publication date: 25-Mar-2016
    • (2015)Autotuning algorithmic choice for input sensitivityACM SIGPLAN Notices10.1145/2813885.273796950:6(379-390)Online publication date: 3-Jun-2015
    • (2015)Autotuning algorithmic choice for input sensitivityProceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/2737924.2737969(379-390)Online publication date: 3-Jun-2015

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media