research-article

Public Access

Architecture-Adaptive Code Variant Tuning

Authors:

Saurav Muralidharan,

Michael Garland,

Piyush RaiAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 44, Issue 2

Pages 325 - 338

https://doi.org/10.1145/2980024.2872411

Published: 25 March 2016 Publication History

Abstract

Code variants represent alternative implementations of a computation, and are common in high-performance libraries and applications to facilitate selecting the most appropriate implementation for a specific execution context (target architecture and input dataset). Automating code variant selection typically relies on machine learning to construct a model during an offline learning phase that can be quickly queried at runtime once the execution context is known. In this paper, we define a new approach called architecture-adaptive code variant tuning, where the variant selection model is learned on a set of source architectures, and then used to predict variants on a new target architecture without having to repeat the training process. We pose this as a multi-task learning problem, where each source architecture corresponds to a task; we use device features in the construction of the variant selection model. This work explores the effectiveness of multi-task learning and the impact of different strategies for device feature selection. We evaluate our approach on a set of benchmarks and a collection of six NVIDIA GPU architectures from three distinct generations. We achieve performance results that are mostly comparable to the previous approach of tuning for a single GPU architecture without having to repeat the learning phase.

References

[1]

J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: A language and compiler for algorithmic choice. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI '09, pages 38--49, New York, NY, USA, 2009. ACM.

Digital Library

[2]

S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. Efficient management of parallelism in object oriented numerical software libraries. In Modern Software Tools in Scientific Computing, pages 163--202. Birkh\"auser Press, 1997.

Digital Library

[3]

S. Baxter. Modern GPU library. http://nvlabs.github.io/moderngpu/.

[4]

N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC '09: Proc. Conference on High Performance Computing Networking, Storage and Analysis, Nov. 2009.

Digital Library

[5]

S. Bhowmick, B. Toth, and P. Raghavan. Towards low-cost, high-accuracy classifiers for linear solver selection. In Proceedings of the 9th International Conference on Computational Science: Part I, ICCS '09, pages 463--472, Berlin, Heidelberg, 2009. Springer-Verlag. ISBN 978--3--642-01969--2.

Digital Library

[6]

J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using PhiPAC: A portable, high-performance, ANSI C coding methodology. In Proceedings of the 11th International Conference on Supercomputing, ICS '97, pages 340--347, New York, NY, USA, 1997. ACM. ISBN 0--89791--902--5.

Digital Library

[7]

E. V. Bonilla, F. V. Agakov, and C. K. I. Williams. Kernel multi-task learning using task-specific features. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS), 2007.

[8]

R. Caruana. Multitask learning. Mach. Learn., 28 (1): 41--75, July 1997. ISSN 0885--6125.

Digital Library

[9]

B. Catanzaro. In-place matrix transposition. https://github.com/bryancatanzaro/inplace.

[10]

B. Catanzaro, A. Keller, and M. Garland. A decomposition for in-place matrix transposition. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, pages 193--206, New York, NY, USA, 2014. ACM.

Digital Library

[11]

J. Cavazos, G. Fursin, F. Agakov, E. Bonilla, M. F. P. O'Boyle, and O. Temam. Rapidly selecting good compiler optimizations using performance counters. In Proceedings of the International Symposium on Code Generation and Optimization, CGO '07, pages 185--197, Washington, DC, USA, 2007. IEEE Computer Society. ISBN 0--7695--2764--7.

Digital Library

[12]

C. Chen. Model-guided empirical optimization for memory hierarchy. In Ph.D dissertation, University of Southern California, May 2007.

[13]

M. Christen, O. Schenk, and H. Burkhart. PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS '11, pages 676--687, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-0--7695--4385--7.

Digital Library

[14]

K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Rev., 51 (1): 129--159, Feb. 2009. ISSN 0036--1445.

Digital Library

[15]

T. Davis. The University of Florida Sparse Matrix Collection. ACM Transactions on Mathematical Software, 38: 1:1--1:25, 2011.

Digital Library

[16]

Y. Ding, J. Ansel, K. Veeramachaneni, X. Shen, U.-M. O'Reilly, and S. Amarasinghe. Autotuning algorithmic choice for input sensitivity. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2015, 2015.

Digital Library

[17]

S. Donadio, J. Brodman, T. Roeder, K. Yotov, D. Barthou, A. Cohen, M. J. Garzarán, D. Padua, and K. Pingali. A language for the compact representation of multiple program versions. In Proceedings of the 18th International Conference on Languages and Compilers for Parallel Computing, LCPC'05, pages 136--151, Berlin, Heidelberg, 2006. Springer-Verlag.

Digital Library

[18]

M. Frigo and S. G. Johnson. The fastest fourier transform in the west. In Proceedings of the 1998 International Conference on Acoustics, Speech, and Signal Processing, ICASSP '98, 1997.

[19]

A. Hartono, B. Norris, and P. Sadayappan. Annotation-based empirical performance tuning using Orio. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS '09, pages 1--11. IEEE Computer Society, 2009.

Digital Library

[20]

M. A. Heroux, P. Raghavan, and H. D. Simon. Parallel Processing for Scientific Computing (Software, Environments and Tools). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2006.

[21]

H. Jégou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency for large scale image search. In A. Z. David Forsyth, Philip Torr, editor, European Conference on Computer Vision, volume I of LNCS, pages 304--317. Springer, oct 2008.

[22]

S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In International Parallel and Distributed Processing Symposium (IPDPS), 2010.

[23]

Y. Liu, E. Z. Zhang, and X. Shen. A cross-input adaptive framework for GPU program optimizations. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, IPDPS '09, pages 1--10, Washington, DC, USA, 2009. IEEE Computer Society. ISBN 978--1--4244--3751--1.

Digital Library

[24]

A. Magni, C. Dubach, and M. F. P. O'Boyle. A large-scale cross-architecture evaluation of thread-coarsening. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '13, pages 11:1--11:11, New York, NY, USA, 2013. ACM.

Digital Library

[25]

D. Merrill. Back40 Computing,natexlaba. http://code.google.com/p/back40computing/.

[26]

D. Merrill. CUDA Unbound (CUB),natexlabb. http://nvlabs.github.io/cub/.

[27]

D. Merrill, M. Garland, and A. Grimshaw. Policy-based tuning for performance portability and library co-optimization. In Proc. Innovative Parallel Computing (InPar 2012), May 2012\natexlaba.

[28]

D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 117--128, New York, NY, USA, 2012\natexlabb. ACM. ISBN 978--1--4503--1160--1.

Digital Library

[29]

S. Muralidharan, M. Shantharam, M. Hall, M. Garland, and B. Catanzaro. Nitro: A framework for adaptive code variant tuning. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS '14, pages 501--512. IEEE Computer Society, 2014.

Digital Library

[30]

D. Parello, O. Temam, A. Cohen, and J. Verdun. Towards a systematic, pragmatic and architecture-aware program optimization process for complex processors. In Proceedings of the ACM/IEEE SC2004 Conference on High Performance Networking and Computing, Pittsburgh, PA, USA, 2004.

Digital Library

[31]

E. Photonics and NVIDIA. CULA $|$ sparse. http://www.culatools.com/.

[32]

M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", 93 (2): 232--275, 2005.

[33]

M. Ren, J. Y. Park, M. Houston, A. Aiken, and W. J. Dally. A Tuning Framework for Software-Managed Memory Hierarchies. In International Conference on Parallel Architectures and Compilation Techniques, pages 280--291, October 2008.

Digital Library

[34]

S. Sanfilippo and P. Noordhuis. Redis. http://redis.io.

[35]

M. Stephenson and S. Amarasinghe. Predicting unroll factors using supervised classification. In Proceedings of the International Symposium on Code Generation and Optimization, CGO '05, pages 123--134, Washington, DC, USA, 2005. IEEE Computer Society.

Digital Library

[36]

A. Tiwari, C. Chen, J. Chame, M. Hall, and J. Hollingsworth. A scalable auto-tuning framework for compiler optimization. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--12, May 2009.

Digital Library

[37]

R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A Library of Automatically Tuned Sparse Matrix Kernels. Journal of Physics: Conference Series, 16 (1): 521, 2005.

[38]

R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27 (1-2): 3--35, 2001.

Digital Library

[39]

Q. Yi, K. Seymour, H. You, R. Vuduc, and D. Quinlan. POET: Parameterized optimizations for empirical tuning. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1--8, 2007.

Cited By

Mustafa D(2022)A Survey of Performance Tuning Techniques and Tools for Parallel ApplicationsIEEE Access10.1109/ACCESS.2022.314784610(15036-15055)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3147846
Harrand NDurieux TBroman DBaudry B(2021)The Behavioral Diversity of Java JSON Libraries2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE52982.2021.00050(412-422)Online publication date: Oct-2021
https://doi.org/10.1109/ISSRE52982.2021.00050
Filipovič JHozzová JNezarat AOl'ha JPetrovič F(2021)Using hardware performance counters to speed up autotuning convergence on GPUsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.10.003Online publication date: Oct-2021
https://doi.org/10.1016/j.jpdc.2021.10.003
Show More Cited By

Index Terms

Architecture-Adaptive Code Variant Tuning

Recommendations

Architecture-Adaptive Code Variant Tuning
ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems

Code variants represent alternative implementations of a computation, and are common in high-performance libraries and applications to facilitate selecting the most appropriate implementation for a specific execution context (target architecture and ...
Architecture-Adaptive Code Variant Tuning
ASPLOS '16

Code variants represent alternative implementations of a computation, and are common in high-performance libraries and applications to facilitate selecting the most appropriate implementation for a specific execution context (target architecture and ...
A Motivating Case Study on Code Variant Selection by Reinforcement Learning
High Performance Computing
Abstract
In this paper, we investigate the applicability of reinforcement learning as a possible approach to select code variants. Our approach is based on the observation that code variants are usually convertible between one another by code ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 44, Issue 2

ASPLOS'16

May 2016

774 pages

ISSN:0163-5964

DOI:10.1145/2980024

Editor:
Doug DeGroot
acm dot org

Issue’s Table of Contents

ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
March 2016
824 pages
ISBN:9781450340915
DOI:10.1145/2872362
General Chair:
Tom Conte
Georgia Tech, USA
,
Program Chair:
Yuanyuan Zhou
University of California, San Diego, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2016

Published in SIGARCH Volume 44, Issue 2

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Defense Advanced Research Projects Agency

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
1,278
Total Downloads

Downloads (Last 12 months)281
Downloads (Last 6 weeks)46

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mustafa D(2022)A Survey of Performance Tuning Techniques and Tools for Parallel ApplicationsIEEE Access10.1109/ACCESS.2022.314784610(15036-15055)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3147846
Harrand NDurieux TBroman DBaudry B(2021)The Behavioral Diversity of Java JSON Libraries2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE52982.2021.00050(412-422)Online publication date: Oct-2021
https://doi.org/10.1109/ISSRE52982.2021.00050
Filipovič JHozzová JNezarat AOl'ha JPetrovič F(2021)Using hardware performance counters to speed up autotuning convergence on GPUsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.10.003Online publication date: Oct-2021
https://doi.org/10.1016/j.jpdc.2021.10.003
Petrovič FStřelák DHozzová JOl’ha JTrembecký RBenkner SFilipovič J(2020)A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with Kernel Tuning ToolkitFuture Generation Computer Systems10.1016/j.future.2020.02.069Online publication date: Feb-2020
https://doi.org/10.1016/j.future.2020.02.069
Oľha JHozzová JFousek JFilipovič J(2020)Exploiting historical data: Pruning autotuning spaces and estimating the number of tuning stepsConcurrency and Computation: Practice and Experience10.1002/cpe.596232:21Online publication date: 10-Aug-2020
https://doi.org/10.1002/cpe.5962
Oľha JHozzová JFousek JFilipovič J(2019)Exploiting Historical Data: Pruning Autotuning Spaces and Estimating the Number of Tuning StepsEuro-Par 2019: Parallel Processing Workshops10.1007/978-3-030-48340-1_23(295-307)Online publication date: 26-Aug-2019
https://dl.acm.org/doi/10.1007/978-3-030-48340-1_23
Xydis SChristoforidis ESoudris DLi Z(2020)DDOTProceedings of the 57th ACM/EDAC/IEEE Design Automation Conference10.5555/3437539.3437636(1-6)Online publication date: 20-Jul-2020
https://dl.acm.org/doi/10.5555/3437539.3437636
Wang TJain NBoehme DBeckingsale DMueller FGamblin TAyguadé EHwu WBadia RHofstee H(2020)CodeSeerProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392741(1-11)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392741
Fang JHuang CTang TWang Z(2020)Parallel programming models for heterogeneous many-cores: a comprehensive surveyCCF Transactions on High Performance Computing10.1007/s42514-020-00039-42:4(382-400)Online publication date: 31-Jul-2020
https://doi.org/10.1007/s42514-020-00039-4
Sorensen TPai SDonaldson A(2019)One Size Doesn't Fit All: Quantifying Performance Portability of Graph Applications on GPUs2019 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC47752.2019.9042139(155-166)Online publication date: Nov-2019
https://doi.org/10.1109/IISWC47752.2019.9042139
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Issue’s Table of Contents