research-article

Mapping parallelism to multi-cores: a machine learning based approach

Authors:

Michael F.P. O'BoyleAuthors Info & Claims

ACM SIGPLAN Notices, Volume 44, Issue 4

Pages 75 - 84

https://doi.org/10.1145/1594835.1504189

Published: 14 February 2009 Publication History

Abstract

The efficient mapping of program parallelism to multi-core processors is highly dependent on the underlying architecture. This paper proposes a portable and automatic compiler-based approach to mapping such parallelism using machine learning. It develops two predictors: a data sensitive and a data insensitive predictor to select the best mapping for parallel programs. They predict the number of threads and the scheduling policy for any given program using a model learnt off-line. By using low-cost profiling runs, they predict the mapping for a new unseen program across multiple input data sets. We evaluate our approach by selecting parallelism mapping configurations for OpenMP programs on two representative but different multi-core platforms (the Intel Xeon and the Cell processors). Performance of our technique is stable across programs and architectures. On average, it delivers above 96% performance of the maximum available on both platforms. It achieve, on average, a 37% (up to 17.5 times) performance improvement over the OpenMP runtime default scheme on the Cell platform. Compared to two recent prediction models, our predictors achieve better performance with a significant lower profiling cost.

References

[1]

D. H. Bailey, E. Barszcz, et al. The NAS parallel benchmarks. The International Journal of Supercomputer Applications, 5(3):63--73, 1991.

Digital Library

[2]

B. Barnes, B. Rountree, et al. A regression-based approach to scalability prediction. In ICS'08, 2008.

Digital Library

[3]

E. B. Bernhard, M. G. Isabelle, et al. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, 1992.

Digital Library

[4]

C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Oxford, U. K., 1996.

Digital Library

[5]

F. Blagojevic, X. Feng, et al. Modeling multi-grain parallelism on heterogeneous multicore processors: A case study of the Cell BE. In HiPEAC'08, 2008.

Digital Library

[6]

J. Cavazos, G. Fursin, et al. Rapidly selecting good compiler optimizations using performance counters. In CGO'07, 2007.

Digital Library

[7]

K. D. Cooper, P. J. Schielke, et al. Optimizing for reduced code space using genetic algorithms. In LCTES'99, 1999.

Digital Library

[8]

J. Corbalan, X. Martorell, et al. Performance-driven processor allocation. IEEE Transaction Parallel Distribution System, 16(7):599--611, 2005.

Digital Library

[9]

L. Dagum and R. Menon. OpenMP: An Industry-Standard API for Shared-Memory Programming. IEEE Comput. Sci. Eng., 5(1):46--55, 1998.

Digital Library

[10]

E. C. David, M. K. Richard, et al. LogP: a practical model of parallel computation. Communications of the ACM, 39(11):78--85, 1996.

Digital Library

[11]

M. Gabriel and M. John. Cross-architecture performance predictions for scientific applications using parameterized models. In SIGMETRICS'04, 2004.

Digital Library

[12]

M. R. Guthaus, J. S. Ringenberg, et al. Mibench: A free, commercially representative embedded benchmark suite, 2001.

[13]

H. Hofstee. Future microprocessors and off-chip SOP interconnect. Advanced Packaging, IEEE Transactions on, 27(2):301--303, May 2004.

[14]

S. Ilya, K. Robert, et al. A case study in top-down performance estimation for a large-scale parallel application. In PPoPP'06, 2006.

Digital Library

[15]

E. Ipek, B. R. de Supinski, et al. An approach to performance prediction for parallel applications. In Euro-Par'05, 2005.

Digital Library

[16]

Y.-K. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv., 31(4):406--471, 1999.

Digital Library

[17]

C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO'04, 2004.

Digital Library

[18]

C. Lee. UTDSP benchmark suite, http://www.eecg.toronto.edu/~corinna/DSP/infrastructure/UTDSP.html.

[19]

G. V. Leslie. A bridging model for parallel computation. Communications of the ACM, 33(8):103--111, 1990.

Digital Library

[20]

C. Liao and B. Chapman. A compile-time cost model for OpenMP. In IPDPS'07, 2007.

[21]

C. L. Liu and W. L. James. Scheduling algorithms for multiprogramming in a hard-real-time environment. J. ACM, 20(1):46--61, 1973.

Digital Library

[22]

S. Long, G. Fursin, et al. A cost-aware parallel workload allocation approach based on machine learning. In NPC '07, 2007.

Digital Library

[23]

C. K. Luk, Robert Cohn, et al. Pin: building customized program analysis tools with dynamic instrumentation. In PLDI'05, 2005.

Digital Library

[24]

B. S. Macey and A. Y. Zomaya. A performance evaluation of CP list scheduling heuristics for communication intensive task graphs. In IPPS/SPDP'98, 1998.

Digital Library

[25]

Z. Qin, C. Ioana, et al. Pipa: pipelined profiling and analysis on multicore systems. In CGO'08, 2008.

Digital Library

[26]

J. Ramanujam and P. Sadayappan. A methodology for parallelizing programs for multicomputers and complex memory multiprocessors. In SuperComputing'89, 1989.

Digital Library

[27]

T. Xinmin, G. Milind, et al. Compiler and Runtime Support for Running OpenMP Programs on Pentium-and Itanium-Architectures. In IPDPS'03, 2003.

Digital Library

[28]

Z. Yun and V. Michael. Runtime empirical selection of loop schedulers on hyperthreaded SMPs. In IPDPS'05, 2005.

Digital Library

Cited By

Duc MLuan NTai NHieu NMinh NHieu PHai VThanh N(2024)On the Impact of Heterogeneity on Federated Learning at the Edge with DGA Malware DetectionProceedings of the Asian Internet Engineering Conference 202410.1145/3674213.3674215(10-17)Online publication date: 9-Aug-2024
https://dl.acm.org/doi/10.1145/3674213.3674215
Yuxiang LZhiyong ZXinyong WShuaina HYaning S(2024)IDaTPA: importance degree based thread partitioning approach in thread level speculationDiscover Computing10.1007/s10791-024-09440-x27:1Online publication date: 19-Jun-2024
https://doi.org/10.1007/s10791-024-09440-x
Scravaglieri LPopov MLima Pilla LGuermouche AAumage OSaillard E(2023)Optimizing performance and energy across problem sizes through a search space exploration and machine learningJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104720180(104720)Online publication date: Oct-2023
https://doi.org/10.1016/j.jpdc.2023.104720
Show More Cited By

Index Terms

Mapping parallelism to multi-cores: a machine learning based approach
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. General programming languages

Recommendations

Mapping parallelism to multi-cores: a machine learning based approach
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

The efficient mapping of program parallelism to multi-core processors is highly dependent on the underlying architecture. This paper proposes a portable and automatic compiler-based approach to mapping such parallelism using machine learning. It ...
Partitioning streaming parallelism for multi-cores: a machine learning based approach
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

Stream based languages are a popular approach to expressing parallelism in modern applications. The efficient mapping of streaming parallelism to multi-core processors is, however, highly dependent on the program and underlying architecture. We address ...
Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Intel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 44, Issue 4

PPoPP '09

April 2009

294 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/1594835

Issue’s Table of Contents

PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
February 2009
322 pages
ISBN:9781605583976
DOI:10.1145/1504176
General Chair:
Daniel Reed
Microsoft Research, USA
,
Program Chair:
Vivek Sarkar
Rice University, USA

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 February 2009

Published in SIGPLAN Volume 44, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

174
Total Citations
View Citations
1,854
Total Downloads

Downloads (Last 12 months)55
Downloads (Last 6 weeks)7

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Duc MLuan NTai NHieu NMinh NHieu PHai VThanh N(2024)On the Impact of Heterogeneity on Federated Learning at the Edge with DGA Malware DetectionProceedings of the Asian Internet Engineering Conference 202410.1145/3674213.3674215(10-17)Online publication date: 9-Aug-2024
https://dl.acm.org/doi/10.1145/3674213.3674215
Yuxiang LZhiyong ZXinyong WShuaina HYaning S(2024)IDaTPA: importance degree based thread partitioning approach in thread level speculationDiscover Computing10.1007/s10791-024-09440-x27:1Online publication date: 19-Jun-2024
https://doi.org/10.1007/s10791-024-09440-x
Scravaglieri LPopov MLima Pilla LGuermouche AAumage OSaillard E(2023)Optimizing performance and energy across problem sizes through a search space exploration and machine learningJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104720180(104720)Online publication date: Oct-2023
https://doi.org/10.1016/j.jpdc.2023.104720
TehraniJamsaz APopov MDutta ASaillard EJannesari A(2022)Learning Intermediate Representations using Graph Neural Networks for NUMA and Prefetchers Optimization2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00120(1206-1216)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00120
Jia CRen JGao LLi ZZheng JWang Z(2022)Adaptive Model Selection for Video Super Resolution2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00172(1088-1094)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00172
Hou ZShen HZhou XGu JWang YZhao T(2022)Prediction of job characteristics for intelligent resource allocation in HPC systems: a survey and future directionsFrontiers of Computer Science10.1007/s11704-022-0625-816:5Online publication date: 23-May-2022
https://doi.org/10.1007/s11704-022-0625-8
Alcaraz JTehraniJamsaz ADutta ASikora AJannesari ASorribes JCesar E(2022)Predicting number of threads using balanced datasets for openMP regionsComputing10.1007/s00607-022-01081-6105:5(999-1017)Online publication date: 30-Apr-2022
https://doi.org/10.1007/s00607-022-01081-6
Alcaraz JSleder STehraniJamsaz ASikora AJannesari ASorribes JCesar E(2021)Building representative and balanced datasets of OpenMP parallel regions2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP52278.2021.00019(67-74)Online publication date: Mar-2021
https://doi.org/10.1109/PDP52278.2021.00019
da Silva AKind Bde Souza Magalhães JRocha JGuimarães BPereira FLee J(2021)AnghaBenchProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370322(378-390)Online publication date: 27-Feb-2021
https://dl.acm.org/doi/10.1109/CGO51591.2021.9370322
Filipovič JHozzová JNezarat AOl'ha JPetrovič F(2021)Using hardware performance counters to speed up autotuning convergence on GPUsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.10.003Online publication date: Oct-2021
https://doi.org/10.1016/j.jpdc.2021.10.003
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents