Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3423211.3425679acmconferencesArticle/Chapter ViewAbstractPublication PagesmiddlewareConference Proceedingsconference-collections
research-article
Public Access

An OpenMP Runtime for Transparent Work Sharing Across Cache-Incoherent Heterogeneous Nodes

Published: 11 December 2020 Publication History
  • Get Citation Alerts
  • Abstract

    In this work we present libHetMP, an OpenMP runtime for automatically and transparently distributing parallel computation across heterogeneous nodes. libHetMP targets platforms comprising CPUs with different instruction set architectures (ISA) coupled by a high-speed memory interconnect, where cross-ISA binary incompatibility and non-coherent caches require application data be marshaled to be shared across CPUs. Because of this, work distribution decisions must take into account both relative compute performance of asymmetric CPUs and communication overheads. libHetMP drives workload distribution decisions without programmer intervention by measuring performance characteristics during cross-node execution. A novel HetProbe loop iteration scheduler decides if cross-node execution is beneficial, and either distributes work according to the relative performance of CPUs when it is, or places all work on the set of homogeneous CPUs providing the best performance when it is not. We evaluate libHetMP using compute kernels from several OpenMP benchmark suites and show a geometric mean 41% speedup in execution time across asymmetric CPUs.

    References

    [1]
    PCI Express Base Specification Revision 4.0, Version 1.0, October 2017. https://pcisig.com/specifications/pciexpress/.
    [2]
    Summit: A Supercomputer Suited for AI, June 2018. https://www.olcf.ornl.gov/wp-content/uploads/2018/06/NODEJnfographic_FIN.pdf.
    [3]
    AMD. AMD Infinity Architecture Technology, September 2020. https://www.amd.com/en/technologies/infinity-architecture.
    [4]
    Amza, C., Cox, A. L., Dwarkadas, S., Keleher, P., Lu, H., Rajamony, R., Yu, W., and Zwaenepoel, W. Treadmarks: shared memory computing on networks of workstations. Computer 29, 2 (Feb 1996), 18--28.
    [5]
    Anandtech. Intel Agilex: 10nm FPGAs with PCIe 5.0, DDR5, and CXL, April 2019. https://www.anandtech.com/show/14149/intel-agilex-10nm-fipgas-with-pcie-50-ddr5-and-cxl.
    [6]
    Bailey, D. H., Barszcz, E., Barton, J. T., Browning, D. S., Carter, R. L., Dagum, L., Fatoohi, R. A., Frederickson, P. O., Lasinski, T. A., Schreiber, R. S., et al. The NAS parallel benchmarks. The International Journal of Supercomputing Applications 5, 3 (1991), 63--73.
    [7]
    Barbalace, A., Lyerly, R., Jelesnianski, C., Carno, A., Chuang, H.-R., Legout, V., and Ravindran, B. Breaking the Boundaries in Heterogeneous-ISA Datacenters. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2017), ASPLOS '17, ACM, pp. 645--659.
    [8]
    Barbalace, A., Sadini, M., Ansary, S., Jelesnianski, C., Ravichandran, A., Kendir, C., Murray, A., and Ravindran, B. Popcorn: Bridging the Programmability Gap in heterogeneous-ISA Platforms. In Proceedings of the Tenth European Conference on Computer Systems (New York, NY, USA, 2015), EuroSys '15, ACM, pp. 29:1--29:16.
    [9]
    Barr, J. New -- EC2 Instances (A1) Powered by Arm-Based AWS Graviton Processors, November 2018. https://awsamazoncom/blogs/aws/new-ec2-instances-a1-powered-by-arm-based-aws-graviton-processors/.
    [10]
    Bienia, C. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011.
    [11]
    Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., Leiserson, C. E., Randall, K. H., and Zhou, Y. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (New York, NY, USA, 1995), PPOPP '95, ACM, pp. 207--216.
    [12]
    Bueno, J., Planas, J., Duran, A., Badia, R. M., Martorell, X., Ayguadé, E., and Labarta, J. Productive Programming of GPU Clusters with OmpSs. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium (May 2012), pp. 557--568.
    [13]
    Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., and Sarkar, V. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (New York, NY, USA, 2005), OOPSLA '05, ACM, pp. 519--538.
    [14]
    Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S. H., and Skadron, K. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC) (Oct 2009).
    [15]
    Coarfa, C., Dotsenko, Y., Mellor-Crummey, J., Cantonnet, F., El-Ghazawi, T., Mohanti, A., Yao, Y., and Chavarría-Miranda, D. An evaluation of global address space languages: Co-array Fortran and Unified Parallel C. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (New York, NY, USA, 2005), PPoPP '05, ACM, pp. 36--47.
    [16]
    Consortium, C., et al. Cache coherent interconnect for accelerators (ccix). Online]. http://www.ccixconsortium.com (2017).
    [17]
    CXL Consortium. Compute Express Link, September 2020. https://www.computeexpresslink.org/.
    [18]
    Daberdaku, S. Parallel computation of voxelised protein surfaces with openmp. In Proceedings of the 6th International Workshop on Parallelism in Bioinformatics (New York, NY, USA, 2018), PBio 2018, Association for Computing Machinery, p. 19--29.
    [19]
    DeVuyst, M., Venkat, A., and Tullsen, D. M. Execution Migration in a heterogeneous-ISA Chip Multiprocessor. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2012), ASPLOS XVII, ACM, pp. 261--272.
    [20]
    Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., and Burger, D. Dark silicon and the end of multicore scaling. In Proceedings of the 38th Annual International Symposium on Computer Architecture (New York, NY, USA, 2011), ISCA '11, ACM, pp. 365--376.
    [21]
    Greenhalgh, P. big.LITTLE Processing with ARM Cortex-A15 & Cortex-A7. ARM White paper 17 (2011).
    [22]
    Grewe, D., and O'Boyle, M. F. P. A static task partitioning approach for heterogeneous systems using OpenCL. In Compiler Construction (Berlin, Heidelberg, 2011), J. Knoop, Ed., Springer Berlin Heidelberg, pp. 286--305.
    [23]
    Gropp, W., Lusk, E., and Skjellum, A. Using MPI: portable parallel programming with the message-passing interface, vol. 1. MIT press, 1999.
    [24]
    Gu, Y., and Mellor-Crummey, J. Dynamic data race detection for openmp programs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (2018), SC '18, IEEE Press.
    [25]
    Hoeflinger, J. P. Extending OpenMP to clusters. White Paper, Intel Corporation (2006).
    [26]
    Jibaja, I., Cao, T., Blackburn, S. M., and McKinley, K. S. Portable performance on asymmetric multicore processors. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (New York, NY, USA, 2016), CGO '16, ACM, pp. 24--35.
    [27]
    Kale, L. V., and Krishnan, S. CHARM++: a portable concurrent object oriented system based on C++, vol. 28. Citeseer, 1993.
    [28]
    Khronos OpenCL Working Group. The OpenCL Specification. Tech. rep., May 2018. https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_API.pdf.
    [29]
    Kim, J., Jo, G., Jung, J., Kim, J., and Lee, J. A Distributed OpenCL Framework Using Redundant Computation and Data Replication. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (New York, NY, USA, 2016), PLDI '16, ACM, pp. 553--569.
    [30]
    Kim, S.-H., Lyerly, R., and Olivier, P. Popcorn Linux: Compiler, Operating System and Virtualization Support for Application/Thread Migration in Heterogeneous ISA Environments. Presented at the 2017 Linux Plumbers Conference, September 2017. http://www.linuxplumbersconf.org/2017/ocw/proposals/4719.html.
    [31]
    Kofler, K., Grasso, I., Cosenza, B., and Fahringer, T. An automatic input-sensitive approach for heterogeneous task partitioning. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (New York, NY, USA, 2013), ICS '13, ACM, pp. 149--160.
    [32]
    Kulkarni, M., Burtscher, M., Casçaval, C., and Pingali, K. Lonestar: A suite of parallel irregular programs. In ISPASS '09: IEEE International Symposium on Performance Analysis of Systems and Software (2009).
    [33]
    Kumar, A. The New Intel Xeon Processor Scalable Family (Formerly Skylake-SP), August 2017. https://www.hotchips.org/wp-content/uploads/hc_archives/hc29/HC29.22-Tuesday-Pub/HC29.22.90-Server-Pub/HC29.22.930-Xeon-Skylake-sp-Kumar-Intel.pdf.
    [34]
    Lepak, K., Talbot, G., White, S., Beck, N., Naffziger, S., FELLOW, S., et al. The next generation AMD enterprise server product architecture. IEEE Hot Chips 29 (2017).
    [35]
    Lin, F. X., Wang, Z., and Zhong, L. K2: A mobile operating system for heterogeneous coherence domains. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2014), ASPLOS 14, ACM, pp. 285--300.
    [36]
    Luk, C.-K., Hong, S., and Kim, H. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture (New York, NY, USA, 2009), MICRO 42, ACM, pp. 45--55.
    [37]
    Lyerly, R., Kim, S.-H., and Ravindran, B. libMPNode: An OpenMP Runtime For Parallel Processing Across Incoherent Domains. In The 10th International Workshop on Programming Modesl and Applications for Multicores and Manycores (February 2019), PMAM '19.
    [38]
    Morin, C., Lottiaux, R., Vallee, G., Gallard, P., Margery, D., Berthou, J., and Scherson, I.D. Kerrighed and data parallelism: cluster computing on single system image operating systems. In 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) (Sept 2004), pp. 277--286.
    [39]
    Nelson, J., Holt, B., Myers, B., Briggs, P., Ceze, L., Kahan, S., and Oskin, M. Latency-tolerant software distributed shared memory. In 2015 USENIX Annual Technical Conference (USENIX ATC 15) (Santa Clara, CA, 2015), USENIX Association, pp. 291--305.
    [40]
    NVidia. NVLink, September 2020. https://www.nvidia.com/en-us/data-center/nvlink/.
    [41]
    OpenCAPI Consortium. OpenCAPI Consortium, September 2020. https://opencapi.org/.
    [42]
    OpenMP Architecture Review Board. OpenMP Application Program Interface v5.0. Tech. rep., OpenMP Architecture Review Board, November 2018. https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf.
    [43]
    Petrucci, V., Loques, O., and Mossé, D. Lucky scheduling for energy-efficient heterogeneous multi-core systems. In Proceedings of the 2012 USENIX Conference on Power-Aware Computing and Systems (Berkeley, CA, USA, 2012), HotPower'12, USENIX Association, pp. 7--7.
    [44]
    Platform, T. N. Next-Generation ThunderX2 ARM Targets Sky-lake Xeons, 2018. https://www.nextplatform.com/2016/06/03/next-generation-thunderx2-arm-targets-skylake-xeons/.
    [45]
    Putnam, A., Caulfield, A. M., Chung, E. S., Chiou, D., Constantinides, K., Demme, J., Esmaeilzadeh, H., Fowers, J., Gopal, G. P., Gray, J., Haselman, M., Hauck, S., Heil, S., Hormati, A., Kim, J.-Y., Lanka, S., Larus, J., Peterson, E., Pope, S., Smith, A., Thong, J., Xiao, P. Y., and Burger, D. A reconfigurable fabric for accelerating large-scale datacenter services. Commun. ACM 59, 11 (Oct. 2016), 114--122.
    [46]
    Qualcomm. Qualcomm snapdragon 855 mobile platform, 2019. https://www.qualcomm.com/media/documents/files/snapdragon-855-mobile-platform-product-brief.pdf.
    [47]
    Rakvic, R., Cai, Q., González, J., Magklis, G., Chaparro, P., and González, A. Thread-management techniques to maximize efficiency in multicore and simultaneous multithreaded microprocessors. ACM Trans. Archit. Code Optim. 7, 2 (Oct. 2010), 9:1--9:25.
    [48]
    Ratna, A. A. P., Ibrahim, I., and Purnamasari, P. D. Parallel processing design of latent semantic analysis based essay grading system with openmp. In Proceedings of the 2017 International Conference on Computer Science and Artificial Intelligence (New York, NY, USA, 2017), CSAI 2017, Association for Computing Machinery, p. 119--124.
    [49]
    Scogland, T. R. W., Feng, W., Rountree, B., and de Supinski, B. R. Coretsar: Core Task-Size Adapting Runtime. IEEE Transactions on Parallel and Distributed Systems 26, 11 (Nov 2015), 2970--2983.
    [50]
    Seo, S., Jo, G., and Lee, J. Performance characterization of the NAS Parallel Benchmarks in OpenCL. In 2011 IEEE International Symposium on Workload Characterization (IISWC) (Nov 2011), pp. 137--148.
    [51]
    Shrivastava, R., and Nandivada, V. K. Energy-efficient compilation of irregular task-parallel loops. ACM Trans. Archit. Code Optim. 14, 4 (Nov. 2017), 35:1--35:29.
    [52]
    Sutter, H. The free lunch is over: A fundamental turn toward concurrency in software. Dr. Dobb's journal 30, 3 (2005), 202--210.
    [53]
    Sutter, H. Welcome to the jungle, August 2012. https://herbsutter.com/welcome-to-the-jungle/.
    [54]
    Venkat, A., and Tullsen, D. M. Harnessing ISA Diversity: Design of a heterogeneous-ISA Chip Multiprocessor. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (Piscataway, NJ, USA, 2014), ISCA '14, IEEE Press, pp. 121--132.
    [55]
    von Bank, D. G., Shub, C. M., and Sebesta, R. W. A unified model of pointwise equivalence of procedural computations. ACM Trans. Program. Lang. Syst. 16, 6 (Nov. 1994), 1842--1874.

    Cited By

    View all
    • (2023)Rapid Development of OS Support with PMCSched for Scheduling on Asymmetric Multicore SystemsEuro-Par 2022: Parallel Processing Workshops10.1007/978-3-031-31209-0_14(184-196)Online publication date: 2-May-2023
    • (2022)An OpenMP Runtime for Transparent Work Sharing across Cache-Incoherent Heterogeneous NodesACM Transactions on Computer Systems10.1145/350522439:1-4(1-30)Online publication date: 5-Jul-2022

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    Middleware '20: Proceedings of the 21st International Middleware Conference
    December 2020
    455 pages
    ISBN:9781450381536
    DOI:10.1145/3423211
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 December 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. OpenMP
    2. heterogeneous-ISA CPUs
    3. work sharing

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    Middleware '20
    Sponsor:
    Middleware '20: 21st International Middleware Conference
    December 7 - 11, 2020
    Delft, Netherlands

    Acceptance Rates

    Overall Acceptance Rate 203 of 948 submissions, 21%

    Upcoming Conference

    MIDDLEWARE '24
    25th International Middleware Conference
    December 2 - 6, 2024
    Hong Kong , Hong Kong

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)59
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Rapid Development of OS Support with PMCSched for Scheduling on Asymmetric Multicore SystemsEuro-Par 2022: Parallel Processing Workshops10.1007/978-3-031-31209-0_14(184-196)Online publication date: 2-May-2023
    • (2022)An OpenMP Runtime for Transparent Work Sharing across Cache-Incoherent Heterogeneous NodesACM Transactions on Computer Systems10.1145/350522439:1-4(1-30)Online publication date: 5-Jul-2022

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media