Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

Architecture-Adaptive Code Variant Tuning

Published: 25 March 2016 Publication History

Abstract

Code variants represent alternative implementations of a computation, and are common in high-performance libraries and applications to facilitate selecting the most appropriate implementation for a specific execution context (target architecture and input dataset). Automating code variant selection typically relies on machine learning to construct a model during an offline learning phase that can be quickly queried at runtime once the execution context is known. In this paper, we define a new approach called architecture-adaptive code variant tuning, where the variant selection model is learned on a set of source architectures, and then used to predict variants on a new target architecture without having to repeat the training process. We pose this as a multi-task learning problem, where each source architecture corresponds to a task; we use device features in the construction of the variant selection model. This work explores the effectiveness of multi-task learning and the impact of different strategies for device feature selection. We evaluate our approach on a set of benchmarks and a collection of six NVIDIA GPU architectures from three distinct generations. We achieve performance results that are mostly comparable to the previous approach of tuning for a single GPU architecture without having to repeat the learning phase.

References

[1]
J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: A language and compiler for algorithmic choice. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI '09, pages 38--49, New York, NY, USA, 2009. ACM.
[2]
S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. Efficient management of parallelism in object oriented numerical software libraries. In Modern Software Tools in Scientific Computing, pages 163--202. Birkh\"auser Press, 1997.
[3]
S. Baxter. Modern GPU library. http://nvlabs.github.io/moderngpu/.
[4]
N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC '09: Proc. Conference on High Performance Computing Networking, Storage and Analysis, Nov. 2009.
[5]
S. Bhowmick, B. Toth, and P. Raghavan. Towards low-cost, high-accuracy classifiers for linear solver selection. In Proceedings of the 9th International Conference on Computational Science: Part I, ICCS '09, pages 463--472, Berlin, Heidelberg, 2009. Springer-Verlag. ISBN 978--3--642-01969--2.
[6]
J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using PhiPAC: A portable, high-performance, ANSI C coding methodology. In Proceedings of the 11th International Conference on Supercomputing, ICS '97, pages 340--347, New York, NY, USA, 1997. ACM. ISBN 0--89791--902--5.
[7]
E. V. Bonilla, F. V. Agakov, and C. K. I. Williams. Kernel multi-task learning using task-specific features. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS), 2007.
[8]
R. Caruana. Multitask learning. Mach. Learn., 28 (1): 41--75, July 1997. ISSN 0885--6125.
[9]
B. Catanzaro. In-place matrix transposition. https://github.com/bryancatanzaro/inplace.
[10]
B. Catanzaro, A. Keller, and M. Garland. A decomposition for in-place matrix transposition. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, pages 193--206, New York, NY, USA, 2014. ACM.
[11]
J. Cavazos, G. Fursin, F. Agakov, E. Bonilla, M. F. P. O'Boyle, and O. Temam. Rapidly selecting good compiler optimizations using performance counters. In Proceedings of the International Symposium on Code Generation and Optimization, CGO '07, pages 185--197, Washington, DC, USA, 2007. IEEE Computer Society. ISBN 0--7695--2764--7.
[12]
C. Chen. Model-guided empirical optimization for memory hierarchy. In Ph.D dissertation, University of Southern California, May 2007.
[13]
M. Christen, O. Schenk, and H. Burkhart. PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS '11, pages 676--687, Washington, DC, USA, 2011. IEEE Computer Society. ISBN 978-0--7695--4385--7.
[14]
K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Rev., 51 (1): 129--159, Feb. 2009. ISSN 0036--1445.
[15]
T. Davis. The University of Florida Sparse Matrix Collection. ACM Transactions on Mathematical Software, 38: 1:1--1:25, 2011.
[16]
Y. Ding, J. Ansel, K. Veeramachaneni, X. Shen, U.-M. O'Reilly, and S. Amarasinghe. Autotuning algorithmic choice for input sensitivity. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2015, 2015.
[17]
S. Donadio, J. Brodman, T. Roeder, K. Yotov, D. Barthou, A. Cohen, M. J. Garzarán, D. Padua, and K. Pingali. A language for the compact representation of multiple program versions. In Proceedings of the 18th International Conference on Languages and Compilers for Parallel Computing, LCPC'05, pages 136--151, Berlin, Heidelberg, 2006. Springer-Verlag.
[18]
M. Frigo and S. G. Johnson. The fastest fourier transform in the west. In Proceedings of the 1998 International Conference on Acoustics, Speech, and Signal Processing, ICASSP '98, 1997.
[19]
A. Hartono, B. Norris, and P. Sadayappan. Annotation-based empirical performance tuning using Orio. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS '09, pages 1--11. IEEE Computer Society, 2009.
[20]
M. A. Heroux, P. Raghavan, and H. D. Simon. Parallel Processing for Scientific Computing (Software, Environments and Tools). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2006.
[21]
H. Jégou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency for large scale image search. In A. Z. David Forsyth, Philip Torr, editor, European Conference on Computer Vision, volume I of LNCS, pages 304--317. Springer, oct 2008.
[22]
S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In International Parallel and Distributed Processing Symposium (IPDPS), 2010.
[23]
Y. Liu, E. Z. Zhang, and X. Shen. A cross-input adaptive framework for GPU program optimizations. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, IPDPS '09, pages 1--10, Washington, DC, USA, 2009. IEEE Computer Society. ISBN 978--1--4244--3751--1.
[24]
A. Magni, C. Dubach, and M. F. P. O'Boyle. A large-scale cross-architecture evaluation of thread-coarsening. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '13, pages 11:1--11:11, New York, NY, USA, 2013. ACM.
[25]
D. Merrill. Back40 Computing,natexlaba. http://code.google.com/p/back40computing/.
[26]
D. Merrill. CUDA Unbound (CUB),natexlabb. http://nvlabs.github.io/cub/.
[27]
D. Merrill, M. Garland, and A. Grimshaw. Policy-based tuning for performance portability and library co-optimization. In Proc. Innovative Parallel Computing (InPar 2012), May 2012\natexlaba.
[28]
D. Merrill, M. Garland, and A. Grimshaw. Scalable GPU graph traversal. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 117--128, New York, NY, USA, 2012\natexlabb. ACM. ISBN 978--1--4503--1160--1.
[29]
S. Muralidharan, M. Shantharam, M. Hall, M. Garland, and B. Catanzaro. Nitro: A framework for adaptive code variant tuning. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS '14, pages 501--512. IEEE Computer Society, 2014.
[30]
D. Parello, O. Temam, A. Cohen, and J. Verdun. Towards a systematic, pragmatic and architecture-aware program optimization process for complex processors. In Proceedings of the ACM/IEEE SC2004 Conference on High Performance Networking and Computing, Pittsburgh, PA, USA, 2004.
[31]
E. Photonics and NVIDIA. CULA $|$ sparse. http://www.culatools.com/.
[32]
M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", 93 (2): 232--275, 2005.
[33]
M. Ren, J. Y. Park, M. Houston, A. Aiken, and W. J. Dally. A Tuning Framework for Software-Managed Memory Hierarchies. In International Conference on Parallel Architectures and Compilation Techniques, pages 280--291, October 2008.
[34]
S. Sanfilippo and P. Noordhuis. Redis. http://redis.io.
[35]
M. Stephenson and S. Amarasinghe. Predicting unroll factors using supervised classification. In Proceedings of the International Symposium on Code Generation and Optimization, CGO '05, pages 123--134, Washington, DC, USA, 2005. IEEE Computer Society.
[36]
A. Tiwari, C. Chen, J. Chame, M. Hall, and J. Hollingsworth. A scalable auto-tuning framework for compiler optimization. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--12, May 2009.
[37]
R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A Library of Automatically Tuned Sparse Matrix Kernels. Journal of Physics: Conference Series, 16 (1): 521, 2005.
[38]
R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27 (1-2): 3--35, 2001.
[39]
Q. Yi, K. Seymour, H. You, R. Vuduc, and D. Quinlan. POET: Parameterized optimizations for empirical tuning. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1--8, 2007.

Cited By

View all
  • (2022)A Survey of Performance Tuning Techniques and Tools for Parallel ApplicationsIEEE Access10.1109/ACCESS.2022.314784610(15036-15055)Online publication date: 2022
  • (2021)The Behavioral Diversity of Java JSON Libraries2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE52982.2021.00050(412-422)Online publication date: Oct-2021
  • (2021)Using hardware performance counters to speed up autotuning convergence on GPUsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.10.003Online publication date: Oct-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 44, Issue 2
ASPLOS'16
May 2016
774 pages
ISSN:0163-5964
DOI:10.1145/2980024
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
    March 2016
    824 pages
    ISBN:9781450340915
    DOI:10.1145/2872362
    • General Chair:
    • Tom Conte,
    • Program Chair:
    • Yuanyuan Zhou
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2016
Published in SIGARCH Volume 44, Issue 2

Check for updates

Author Tags

  1. autotuning
  2. cross-architectural tuning
  3. device feature selection
  4. input-adaptive
  5. multi-task learning

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)281
  • Downloads (Last 6 weeks)46
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)A Survey of Performance Tuning Techniques and Tools for Parallel ApplicationsIEEE Access10.1109/ACCESS.2022.314784610(15036-15055)Online publication date: 2022
  • (2021)The Behavioral Diversity of Java JSON Libraries2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE52982.2021.00050(412-422)Online publication date: Oct-2021
  • (2021)Using hardware performance counters to speed up autotuning convergence on GPUsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.10.003Online publication date: Oct-2021
  • (2020)A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with Kernel Tuning ToolkitFuture Generation Computer Systems10.1016/j.future.2020.02.069Online publication date: Feb-2020
  • (2020)Exploiting historical data: Pruning autotuning spaces and estimating the number of tuning stepsConcurrency and Computation: Practice and Experience10.1002/cpe.596232:21Online publication date: 10-Aug-2020
  • (2019)Exploiting Historical Data: Pruning Autotuning Spaces and Estimating the Number of Tuning StepsEuro-Par 2019: Parallel Processing Workshops10.1007/978-3-030-48340-1_23(295-307)Online publication date: 26-Aug-2019
  • (2020)DDOTProceedings of the 57th ACM/EDAC/IEEE Design Automation Conference10.5555/3437539.3437636(1-6)Online publication date: 20-Jul-2020
  • (2020)CodeSeerProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392741(1-11)Online publication date: 29-Jun-2020
  • (2020)Parallel programming models for heterogeneous many-cores: a comprehensive surveyCCF Transactions on High Performance Computing10.1007/s42514-020-00039-42:4(382-400)Online publication date: 31-Jul-2020
  • (2019)One Size Doesn't Fit All: Quantifying Performance Portability of Graph Applications on GPUs2019 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC47752.2019.9042139(155-166)Online publication date: Nov-2019
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media