research-article

Public Access

An OpenMP Runtime for Transparent Work Sharing Across Cache-Incoherent Heterogeneous Nodes

Authors:

Christopher J. Rossbach,

Binoy RavindranAuthors Info & Claims

Middleware '20: Proceedings of the 21st International Middleware Conference

Pages 415 - 429

https://doi.org/10.1145/3423211.3425679

Published: 11 December 2020 Publication History

Abstract

In this work we present libHetMP, an OpenMP runtime for automatically and transparently distributing parallel computation across heterogeneous nodes. libHetMP targets platforms comprising CPUs with different instruction set architectures (ISA) coupled by a high-speed memory interconnect, where cross-ISA binary incompatibility and non-coherent caches require application data be marshaled to be shared across CPUs. Because of this, work distribution decisions must take into account both relative compute performance of asymmetric CPUs and communication overheads. libHetMP drives workload distribution decisions without programmer intervention by measuring performance characteristics during cross-node execution. A novel HetProbe loop iteration scheduler decides if cross-node execution is beneficial, and either distributes work according to the relative performance of CPUs when it is, or places all work on the set of homogeneous CPUs providing the best performance when it is not. We evaluate libHetMP using compute kernels from several OpenMP benchmark suites and show a geometric mean 41% speedup in execution time across asymmetric CPUs.

References

[1]

PCI Express Base Specification Revision 4.0, Version 1.0, October 2017. https://pcisig.com/specifications/pciexpress/.

[2]

Summit: A Supercomputer Suited for AI, June 2018. https://www.olcf.ornl.gov/wp-content/uploads/2018/06/NODEJnfographic_FIN.pdf.

[3]

AMD. AMD Infinity Architecture Technology, September 2020. https://www.amd.com/en/technologies/infinity-architecture.

[4]

Amza, C., Cox, A. L., Dwarkadas, S., Keleher, P., Lu, H., Rajamony, R., Yu, W., and Zwaenepoel, W. Treadmarks: shared memory computing on networks of workstations. Computer 29, 2 (Feb 1996), 18--28.

Digital Library

[5]

Anandtech. Intel Agilex: 10nm FPGAs with PCIe 5.0, DDR5, and CXL, April 2019. https://www.anandtech.com/show/14149/intel-agilex-10nm-fipgas-with-pcie-50-ddr5-and-cxl.

[6]

Bailey, D. H., Barszcz, E., Barton, J. T., Browning, D. S., Carter, R. L., Dagum, L., Fatoohi, R. A., Frederickson, P. O., Lasinski, T. A., Schreiber, R. S., et al. The NAS parallel benchmarks. The International Journal of Supercomputing Applications 5, 3 (1991), 63--73.

Digital Library

[7]

Barbalace, A., Lyerly, R., Jelesnianski, C., Carno, A., Chuang, H.-R., Legout, V., and Ravindran, B. Breaking the Boundaries in Heterogeneous-ISA Datacenters. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2017), ASPLOS '17, ACM, pp. 645--659.

Digital Library

[8]

Barbalace, A., Sadini, M., Ansary, S., Jelesnianski, C., Ravichandran, A., Kendir, C., Murray, A., and Ravindran, B. Popcorn: Bridging the Programmability Gap in heterogeneous-ISA Platforms. In Proceedings of the Tenth European Conference on Computer Systems (New York, NY, USA, 2015), EuroSys '15, ACM, pp. 29:1--29:16.

Digital Library

[9]

Barr, J. New -- EC2 Instances (A1) Powered by Arm-Based AWS Graviton Processors, November 2018. https://awsamazoncom/blogs/aws/new-ec2-instances-a1-powered-by-arm-based-aws-graviton-processors/.

[10]

Bienia, C. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011.

Digital Library

[11]

Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., Leiserson, C. E., Randall, K. H., and Zhou, Y. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (New York, NY, USA, 1995), PPOPP '95, ACM, pp. 207--216.

Digital Library

[12]

Bueno, J., Planas, J., Duran, A., Badia, R. M., Martorell, X., Ayguadé, E., and Labarta, J. Productive Programming of GPU Clusters with OmpSs. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium (May 2012), pp. 557--568.

[13]

Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., and Sarkar, V. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (New York, NY, USA, 2005), OOPSLA '05, ACM, pp. 519--538.

[14]

Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S. H., and Skadron, K. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC) (Oct 2009).

Digital Library

[15]

Coarfa, C., Dotsenko, Y., Mellor-Crummey, J., Cantonnet, F., El-Ghazawi, T., Mohanti, A., Yao, Y., and Chavarría-Miranda, D. An evaluation of global address space languages: Co-array Fortran and Unified Parallel C. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (New York, NY, USA, 2005), PPoPP '05, ACM, pp. 36--47.

Digital Library

[16]

Consortium, C., et al. Cache coherent interconnect for accelerators (ccix). Online]. http://www.ccixconsortium.com (2017).

[17]

CXL Consortium. Compute Express Link, September 2020. https://www.computeexpresslink.org/.

[18]

Daberdaku, S. Parallel computation of voxelised protein surfaces with openmp. In Proceedings of the 6th International Workshop on Parallelism in Bioinformatics (New York, NY, USA, 2018), PBio 2018, Association for Computing Machinery, p. 19--29.

Digital Library

[19]

DeVuyst, M., Venkat, A., and Tullsen, D. M. Execution Migration in a heterogeneous-ISA Chip Multiprocessor. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2012), ASPLOS XVII, ACM, pp. 261--272.

Digital Library

[20]

Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., and Burger, D. Dark silicon and the end of multicore scaling. In Proceedings of the 38th Annual International Symposium on Computer Architecture (New York, NY, USA, 2011), ISCA '11, ACM, pp. 365--376.

[21]

Greenhalgh, P. big.LITTLE Processing with ARM Cortex-A15 & Cortex-A7. ARM White paper 17 (2011).

[22]

Grewe, D., and O'Boyle, M. F. P. A static task partitioning approach for heterogeneous systems using OpenCL. In Compiler Construction (Berlin, Heidelberg, 2011), J. Knoop, Ed., Springer Berlin Heidelberg, pp. 286--305.

[23]

Gropp, W., Lusk, E., and Skjellum, A. Using MPI: portable parallel programming with the message-passing interface, vol. 1. MIT press, 1999.

Digital Library

[24]

Gu, Y., and Mellor-Crummey, J. Dynamic data race detection for openmp programs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (2018), SC '18, IEEE Press.

[25]

Hoeflinger, J. P. Extending OpenMP to clusters. White Paper, Intel Corporation (2006).

[26]

Jibaja, I., Cao, T., Blackburn, S. M., and McKinley, K. S. Portable performance on asymmetric multicore processors. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (New York, NY, USA, 2016), CGO '16, ACM, pp. 24--35.

[27]

Kale, L. V., and Krishnan, S. CHARM++: a portable concurrent object oriented system based on C++, vol. 28. Citeseer, 1993.

Digital Library

[28]

Khronos OpenCL Working Group. The OpenCL Specification. Tech. rep., May 2018. https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_API.pdf.

[29]

Kim, J., Jo, G., Jung, J., Kim, J., and Lee, J. A Distributed OpenCL Framework Using Redundant Computation and Data Replication. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (New York, NY, USA, 2016), PLDI '16, ACM, pp. 553--569.

[30]

Kim, S.-H., Lyerly, R., and Olivier, P. Popcorn Linux: Compiler, Operating System and Virtualization Support for Application/Thread Migration in Heterogeneous ISA Environments. Presented at the 2017 Linux Plumbers Conference, September 2017. http://www.linuxplumbersconf.org/2017/ocw/proposals/4719.html.

[31]

Kofler, K., Grasso, I., Cosenza, B., and Fahringer, T. An automatic input-sensitive approach for heterogeneous task partitioning. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (New York, NY, USA, 2013), ICS '13, ACM, pp. 149--160.

[32]

Kulkarni, M., Burtscher, M., Casçaval, C., and Pingali, K. Lonestar: A suite of parallel irregular programs. In ISPASS '09: IEEE International Symposium on Performance Analysis of Systems and Software (2009).

[33]

Kumar, A. The New Intel Xeon Processor Scalable Family (Formerly Skylake-SP), August 2017. https://www.hotchips.org/wp-content/uploads/hc_archives/hc29/HC29.22-Tuesday-Pub/HC29.22.90-Server-Pub/HC29.22.930-Xeon-Skylake-sp-Kumar-Intel.pdf.

[34]

Lepak, K., Talbot, G., White, S., Beck, N., Naffziger, S., FELLOW, S., et al. The next generation AMD enterprise server product architecture. IEEE Hot Chips 29 (2017).

[35]

Lin, F. X., Wang, Z., and Zhong, L. K2: A mobile operating system for heterogeneous coherence domains. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2014), ASPLOS 14, ACM, pp. 285--300.

Digital Library

[36]

Luk, C.-K., Hong, S., and Kim, H. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture (New York, NY, USA, 2009), MICRO 42, ACM, pp. 45--55.

Digital Library

[37]

Lyerly, R., Kim, S.-H., and Ravindran, B. libMPNode: An OpenMP Runtime For Parallel Processing Across Incoherent Domains. In The 10th International Workshop on Programming Modesl and Applications for Multicores and Manycores (February 2019), PMAM '19.

[38]

Morin, C., Lottiaux, R., Vallee, G., Gallard, P., Margery, D., Berthou, J., and Scherson, I.D. Kerrighed and data parallelism: cluster computing on single system image operating systems. In 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) (Sept 2004), pp. 277--286.

[39]

Nelson, J., Holt, B., Myers, B., Briggs, P., Ceze, L., Kahan, S., and Oskin, M. Latency-tolerant software distributed shared memory. In 2015 USENIX Annual Technical Conference (USENIX ATC 15) (Santa Clara, CA, 2015), USENIX Association, pp. 291--305.

Digital Library

[40]

NVidia. NVLink, September 2020. https://www.nvidia.com/en-us/data-center/nvlink/.

[41]

OpenCAPI Consortium. OpenCAPI Consortium, September 2020. https://opencapi.org/.

[42]

OpenMP Architecture Review Board. OpenMP Application Program Interface v5.0. Tech. rep., OpenMP Architecture Review Board, November 2018. https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf.

[43]

Petrucci, V., Loques, O., and Mossé, D. Lucky scheduling for energy-efficient heterogeneous multi-core systems. In Proceedings of the 2012 USENIX Conference on Power-Aware Computing and Systems (Berkeley, CA, USA, 2012), HotPower'12, USENIX Association, pp. 7--7.

Digital Library

[44]

Platform, T. N. Next-Generation ThunderX2 ARM Targets Sky-lake Xeons, 2018. https://www.nextplatform.com/2016/06/03/next-generation-thunderx2-arm-targets-skylake-xeons/.

[45]

Putnam, A., Caulfield, A. M., Chung, E. S., Chiou, D., Constantinides, K., Demme, J., Esmaeilzadeh, H., Fowers, J., Gopal, G. P., Gray, J., Haselman, M., Hauck, S., Heil, S., Hormati, A., Kim, J.-Y., Lanka, S., Larus, J., Peterson, E., Pope, S., Smith, A., Thong, J., Xiao, P. Y., and Burger, D. A reconfigurable fabric for accelerating large-scale datacenter services. Commun. ACM 59, 11 (Oct. 2016), 114--122.

Digital Library

[46]

Qualcomm. Qualcomm snapdragon 855 mobile platform, 2019. https://www.qualcomm.com/media/documents/files/snapdragon-855-mobile-platform-product-brief.pdf.

[47]

Rakvic, R., Cai, Q., González, J., Magklis, G., Chaparro, P., and González, A. Thread-management techniques to maximize efficiency in multicore and simultaneous multithreaded microprocessors. ACM Trans. Archit. Code Optim. 7, 2 (Oct. 2010), 9:1--9:25.

Digital Library

[48]

Ratna, A. A. P., Ibrahim, I., and Purnamasari, P. D. Parallel processing design of latent semantic analysis based essay grading system with openmp. In Proceedings of the 2017 International Conference on Computer Science and Artificial Intelligence (New York, NY, USA, 2017), CSAI 2017, Association for Computing Machinery, p. 119--124.

[49]

Scogland, T. R. W., Feng, W., Rountree, B., and de Supinski, B. R. Coretsar: Core Task-Size Adapting Runtime. IEEE Transactions on Parallel and Distributed Systems 26, 11 (Nov 2015), 2970--2983.

Digital Library

[50]

Seo, S., Jo, G., and Lee, J. Performance characterization of the NAS Parallel Benchmarks in OpenCL. In 2011 IEEE International Symposium on Workload Characterization (IISWC) (Nov 2011), pp. 137--148.

Digital Library

[51]

Shrivastava, R., and Nandivada, V. K. Energy-efficient compilation of irregular task-parallel loops. ACM Trans. Archit. Code Optim. 14, 4 (Nov. 2017), 35:1--35:29.

Digital Library

[52]

Sutter, H. The free lunch is over: A fundamental turn toward concurrency in software. Dr. Dobb's journal 30, 3 (2005), 202--210.

[53]

Sutter, H. Welcome to the jungle, August 2012. https://herbsutter.com/welcome-to-the-jungle/.

[54]

Venkat, A., and Tullsen, D. M. Harnessing ISA Diversity: Design of a heterogeneous-ISA Chip Multiprocessor. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (Piscataway, NJ, USA, 2014), ISCA '14, IEEE Press, pp. 121--132.

[55]

von Bank, D. G., Shub, C. M., and Sebesta, R. W. A unified model of pointwise equivalence of procedural computations. ACM Trans. Program. Lang. Syst. 16, 6 (Nov. 1994), 1842--1874.

Digital Library

Cited By

Bilbao CSaez JPrieto-Matias M(2023)Rapid Development of OS Support with PMCSched for Scheduling on Asymmetric Multicore SystemsEuro-Par 2022: Parallel Processing Workshops10.1007/978-3-031-31209-0_14(184-196)Online publication date: 2-May-2023
https://doi.org/10.1007/978-3-031-31209-0_14
Lyerly RBilbao CMin CRossbach CRavindran B(2022)An OpenMP Runtime for Transparent Work Sharing across Cache-Incoherent Heterogeneous NodesACM Transactions on Computer Systems10.1145/350522439:1-4(1-30)Online publication date: 5-Jul-2022
https://dl.acm.org/doi/10.1145/3505224

Index Terms

An OpenMP Runtime for Transparent Work Sharing Across Cache-Incoherent Heterogeneous Nodes

Recommendations

An OpenMP Runtime for Transparent Work Sharing across Cache-Incoherent Heterogeneous Nodes
In this work, we present libHetMP, an OpenMP runtime for automatically and transparently distributing parallel computation across heterogeneous nodes. libHetMP targets platforms comprising CPUs with different instruction set architectures (ISA) coupled by ...
Enhancing Heterogeneous Computing Through OpenMP and GPU Graph
ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

Modern computing platforms are increasingly heterogeneous, most of them include accelerators such as GPU. OpenMP as the de-facto standard to parallelize CPU applications, incorporates target construct allowing users to offload work onto such ...
Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing

Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

Middleware '20: Proceedings of the 21st International Middleware Conference

December 2020

455 pages

ISBN:9781450381536

DOI:10.1145/3423211

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 December 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

NAVSEA/NEEC
Office of Naval Research

Conference

Middleware '20

Sponsor:

ACM

Middleware '20: 21st International Middleware Conference

December 7 - 11, 2020

Delft, Netherlands

Acceptance Rates

Overall Acceptance Rate 203 of 948 submissions, 21%

Upcoming Conference

MIDDLEWARE '24

25th International Middleware Conference

December 2 - 6, 2024

Hong Kong , Hong Kong

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
199
Total Downloads

Downloads (Last 12 months)59
Downloads (Last 6 weeks)6

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bilbao CSaez JPrieto-Matias M(2023)Rapid Development of OS Support with PMCSched for Scheduling on Asymmetric Multicore SystemsEuro-Par 2022: Parallel Processing Workshops10.1007/978-3-031-31209-0_14(184-196)Online publication date: 2-May-2023
https://doi.org/10.1007/978-3-031-31209-0_14
Lyerly RBilbao CMin CRossbach CRavindran B(2022)An OpenMP Runtime for Transparent Work Sharing across Cache-Incoherent Heterogeneous NodesACM Transactions on Computer Systems10.1145/350522439:1-4(1-30)Online publication date: 5-Jul-2022
https://dl.acm.org/doi/10.1145/3505224

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents