research-article

Open access

BaCO: A Fast and Portable Bayesian Compiler Optimization Framework

Authors:

Erik Orm Hellsten,

Johannes Lenfers,

Rubens Lacouture,

Fredrik Kjolstad,

Michel Steuwer,

Kunle Olukotun,

Luigi NardiAuthors Info & Claims

ASPLOS '23: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 4

Pages 19 - 42

https://doi.org/10.1145/3623278.3624770

Published: 07 February 2024 Publication History

Abstract

We introduce the Bayesian Compiler Optimization framework (BaCO), a general purpose autotuner for modern compilers targeting CPUs, GPUs, and FPGAs. BaCO provides the flexibility needed to handle the requirements of modern autotuning tasks. Particularly, it deals with permutation, ordered, and continuous parameter types along with both known and unknown parameter constraints. To reason about these parameter types and efficiently deliver high-quality code, BaCO uses Bayesian optimization algorithms specialized towards the autotuning domain. We demonstrate BaCO's effectiveness on three modern compiler systems: TACO, RISE & ELEVATE, and HPVM2FPGA for CPUs, GPUs, and FPGAs respectively. For these domains, BaCO outperforms current state-of-the-art auto-tuners by delivering on average 1.36X--1.56X faster code with a tiny search budget, and BaCO is able to reach expert-level performance 2.9X--3.9X faster.

References

[1]

Proteas-tune. https://www.ornl.gov/project/proteas-tune. Accessed: 2022-10-18.

[2]

Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O'Reilly, and Saman Amarasinghe. OpenTuner: An extensible framework for program autotuning. In International Conference on Parallel Architectures and Compilation Techniques (PACT), 2014.

[3]

The GPyOpt authors. GPyOpt: A bayesian optimization framework in python. http://github.com/SheffieldML/GPyOpt, 2016.

[4]

Prasanna Balaprakash, Jack Dongarra, Todd Gamblin, Mary Hall, Jeffrey K Hollingsworth, Boyana Norris, and Richard Vuduc. Autotuning in high-performance computing applications. Proceedings of the IEEE, 106(11):2068--2083, 2018.

[5]

Pedro Bruel, Alfredo Goldman, Sai Rahul Chalamalasetti, and Dejan Milojicic. Autotuning high-level synthesis for fpgas using opentuner and legup. In International Conference on ReConFigurable Computing and FPGAs (ReConFig). IEEE, 2017.

[6]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In International Symposium on Workload Characterization (IISWC). IEEE, 2009.

Digital Library

[7]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. TVM: An automated End-to-End optimizing compiler for deep learning. In USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018.

[8]

Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. Format abstraction for sparse tensor algebra compilers. Proc. ACM Program. Lang., 2(OOPSLA):123:1--123:30, October 2018.

Digital Library

[9]

Jhouben Cuesta Ramirez, Rodolphe Le Riche, Olivier Roustant, Guillaume Perrin, Cedric Durantin, and Alain Gliere. A comparison of mixed-variables bayesian optimization approaches. Advanced Modeling and Simulation in Engineering Sciences, 9(1):1--29, 2022.

[10]

Timothy A Davis and Yifan Hu. The university of florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS), 38(1):1--25, 2011.

[11]

Matthieu Dorier, Romain Egele, Prasanna Balaprakash, Jaehoon Koo, Sandeep Madireddy, Srinivasan Ramesh, Allen D Malony, and Rob Ross. Hpc storage service autotuning using variational-autoencoder-guided asynchronous bayesian optimization. arXiv preprint arXiv:2210.00798, 2022.

[12]

Adel Ejjeh, Aaron Councilman, Akash Kothari, Maria Kotsifakou, Leon Medvinsky, Abdul Rafae Noor, Hashim Sharif, Yifan Zhao, Sarita Adve, Sasa Misailovic, et al. HPVM: Hardware-agnostic programming for heterogeneous parallel systems. IEEE Micro, 42(5):108--117, 2022.

Digital Library

[13]

Adel Ejjeh, Leon Medvinsky, Aaron Councilman, Hemang Nehra, Suraj Sharma, Vikram Adve, Luigi Nardi, Eriko Nurvitadhi, and Rob A Rutenbar. HPVM2FPGA: Enabling true hardware-agnostic fpga programming. In IEEE International Conference on Application-specific Systems, Architectures, and Processors (ASAP), 2022.

[14]

Lorenzo Ferretti, Andrea Cini, Georgios Zacharopoulos, Cesare Alippi, and Laura Pozzi. A graph deep learning framework for high-level synthesis design space exploration, 2021.

[15]

Peter I Frazier. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.

[16]

Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. Sparse gpu kernels for deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '20. IEEE Press, 2020.

Digital Library

[17]

Jacob R Gardner, Matt J Kusner, Zhixiang Eddie Xu, Kilian Q Weinberger, and John P Cunningham. Bayesian optimization with inequality constraints. In International Conference on Machine Learning (ICML), 2014.

[18]

Eduardo C Garrido-Merchan and Daniel Hernández-Lobato. Dealing with categorical and integer-valued variables in bayesian optimization with gaussian processes. Neurocomputing, 380:20--35, 2020.

Digital Library

[19]

Bastian Hagedorn, Johannes Lenfers, Thomas Koehler, Xueying Qin, Sergei Gorlatch, and Michel Steuwer. Achieving high-performance the functional way: a functional pearl on expressing high-performance optimizations as rewrite strategies. Proceedings of the ACM on Programming Languages, 4(ICFP):1--29, 2020.

Digital Library

[20]

Bastian Hagedorn, Larisa Stoltzfus, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach. High performance stencil code generation with Lift. In International Symposium on Code Generation and Optimization (CGO), 2018.

[21]

Ameer Haj-Ali, Hasan Genc, Qijing Huang, William Moses, John Wawrzynek, Krste Asanović, and Ion Stoica. Protuner: tuning programs with monte carlo tree search. arXiv preprint arXiv:2005.13685, 2020.

[22]

Ameer Haj-Ali, Qijing Jenny Huang, John Xiang, William Moses, Krste Asanovic, John Wawrzynek, and Ion Stoica. Autophase: Juggling hls phase orderings in random forests with deep reinforcement learning. Proceedings of Machine Learning and Systems, 2:70--81, 2020.

[23]

Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In International conference on learning and intelligent optimization (LION), 2011.

Digital Library

[24]

Muhammad Huzaifa, Rishi Desai, Samuel Grayson, Xutao Jiang, Ying Jing, Jae Lee, Fang Lu, Yihan Pang, Joseph Ravichandran, Finn Sinclair, et al. ILLIXR: Enabling end-to-end extended reality research. In 2021 IEEE International Symposium on Workload Characterization (IISWC), pages 24--38. IEEE, 2021.

[25]

Donald R Jones, Matthias Schonlau, and William J Welch. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455--492, 1998.

Digital Library

[26]

Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. The tensor algebra compiler. Proceedings of the ACM on Programming Languages, 1(OOPSLA), 2017.

Digital Library

[27]

Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. Fast bayesian optimization of machine learning hyperparameters on large datasets. In Artificial intelligence and statistics, pages 528--536. PMLR, 2017.

[28]

Nicolas Knudde, Joachim van der Herten, Tom Dhaene, and Ivo Couckuyt. GPflowOpt: A Bayesian Optimization Library using TensorFlow. arXiv preprint - arXiv:1711.03845, 2017.

[29]

Thomas Koehler and Michel Steuwer. Towards a domain-extensible compiler: optimizing an image processing pipeline on mobile cpus. In International Symposium on Code Generation and Optimization (CGO), 2021.

Digital Library

[30]

David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, et al. Spatial: A language and compiler for application accelerators. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 296--311, 2018.

Digital Library

[31]

Scott P Kolodziej, Mohsen Aznaveh, Matthew Bullock, Jarrett David, Timothy A Davis, Matthew Henderson, Yifan Hu, and Read Sandstrom. The suitesparse matrix collection website interface. Journal of Open Source Software, 4(35):1244, 2019.

[32]

Jaehoon Koo, Prasanna Balaprakash, Michael Kruse, Xingfu Wu, Paul Hovland, and Mary Hall. Customized monte carlo tree search for llvm/polly's composable loop optimization transformations. In International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2021.

[33]

Maria Kotsifakou, Prakalp Srivastava, Matthew D Sinclair, Rakesh Komuravelli, Vikram Adve, and Sarita Adve. Hpvm: Heterogeneous parallel virtual machine. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 68--80, 2018.

Digital Library

[34]

Michael Kruse, Hal Finkel, and Xingfu Wu. Autotuning search space for loop transformations. In IEEE/ACM Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC) and Workshop on Hierarchical Parallelism for Exascale Computing (HiPar), 2020.

[35]

M. Lindauer, K. Eggensperger, M. Feurer, A. Biedenkapp, J. Marben, P. Müller, and F. Hutter. Boah: A tool suite for multi-fidelity bayesian optimization & analysis of hyperparameters. arXiv:1908.06756 [cs.LG].

[36]

Marius Lindauer, Katharina Eggensperger, Matthias Feurer, André Biedenkapp, Difan Deng, Carolin Benjamins, Tim Ruhkopf, René Sass, and Frank Hutter. SMAC3: A versatile bayesian optimization package for hyperparameter optimization. Journal of Machine Learning Research, 23(54):1--9, 2022.

[37]

Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1):503--528, 1989.

[38]

Yang Liu, Wissam M Sid-Lakhdar, Osni Marques, Xinran Zhu, Chang Meng, James W Demmel, and Xiaoye S Li. GPTune: multitask learning for autotuning exascale applications. In Principles and Practice of Parallel Programming (PPoPP), 2021.

[39]

Maria Lomeli, Mark Rowland, Arthur Gretton, and Zoubin Ghahramani. Antithetic and monte carlo kernel estimators for partial rankings. Statistics and Computing, 29(5):1127--1147, 2019.

Digital Library

[40]

Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.

Digital Library

[41]

Luigi Nardi, David Koeplinger, and Kunle Olukotun. Practical design space exploration. In Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), 2019.

[42]

Thomas Nelson, Axel Rivera, Prasanna Balaprakash, Mary Hall, Paul D Hovland, Elizabeth Jessup, and Boyana Norris. Generating efficient tensor contractions for gpus. In International Conference on Parallel Processing (ICPP), 2015.

Digital Library

[43]

Filip Petrovič, David Střelák, Jana Hozzová, Jaroslav Ol'ha, Richard Trembeckỳ, Siegfried Benkner, and Jiří Filipovič. A benchmark set of highly-efficient cuda and opencl kernels and its dynamic autotuning with kernel tuning toolkit. Future Generation Computer Systems, 108:161--177, 2020.

[44]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM Sigplan Notices, 48(6):519--530, 2013.

Digital Library

[45]

Ari Rasch, Michael Haidl, and Sergei Gorlatch. Atf: A generic auto-tuning framework. In 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pages 64--71. IEEE, 2017.

[46]

Ari Rasch, Richard Schulze, Michel Steuwer, and Sergei Gorlatch. Efficient auto-tuning of parallel programs with interdependent tuning parameters via auto-tuning framework (ATF). ACM Transactions on Architecture and Code Optimization, 18(1):1:1--1:26, 2021.

Digital Library

[47]

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. MIT press, 2006.

Digital Library

[48]

Rohan Basu Roy, Tirthak Patel, Vijay Gadepally, and Devesh Tiwari. Bliss: auto-tuning complex applications using a pool of diverse lightweight learning models. In Programming Language Design and Implementation (PLDI), 2021.

Digital Library

[49]

Ryan Senanayake, Changwan Hong, Ziheng Wang, Amalee Wilson, Stephen Chou, Shoaib Kamil, Saman Amarasinghe, and Fredrik Kjolstad. A sparse iteration space transformation framework for sparse tensor algebra. Proceedings of the ACM on Programming Languages, 4(OOPSLA), 2020.

Digital Library

[50]

Wissam M Sid-Lakhdar, Mohsen Mahmoudi Aznaveh, Xiaoye S Li, and James W Demmel. Multitask and transfer learning for autotuning exascale applications. arXiv preprint arXiv:1908.05792, 2019.

[51]

Shaden Smith, Jee W. Choi, Jiajia Li, Richard Vuduc, Jongsoo Park, Xing Liu, and George Karypis. FROSTT: The formidable repository of open sparse tensors and tools, 2017.

[52]

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. 2012.

[53]

Artur Souza, Luigi Nardi, Leonardo B Oliveira, Kunle Olukotun, Marius Lindauer, and Frank Hutter. Bayesian optimization with a prior for the optimum. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2021.

Digital Library

[54]

Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe Dubach. Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance opencl code. ACM SIGPLAN Notices, 50(9):205--217, 2015.

Digital Library

[55]

Michel Steuwer, Thomas Koehler, Bastian Köpcke, and Federico Pizzuti. RISE & shine: Language-oriented compiler design. CoRR, abs/2201.03611, 2022.

[56]

Michel Steuwer, Toomas Remmelg, and Christophe Dubach. Lift: a functional data-parallel ir for high-performance gpu code generation. In Code Generation and Optimization (CGO), 2017.

[57]

Larisa Stoltzfus, Bastian Hagedorn, Michel Steuwer, Sergei Gorlatch, and Christophe Dubach. Tiling optimizations for stencil computations using rewrite rules in lift. ACM Trans. Archit. Code Optim., 16(4), dec 2019.

[58]

Hakki Mert Torun, Madhavan Swaminathan, Anto Kavungal Davis, and Mohamed Lamine Faycal Bellaredj. A global bayesian optimization algorithm and its application to integrated system design. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 26(4):792--802, 2018.

[59]

Bimal Viswanath, Alan Mislove, Meeyoung Cha, and Krishna P Gummadi. On the evolution of user interaction in facebook. In Proceedings of the 2nd ACM workshop on Online social networks, pages 37--42, 2009.

Digital Library

[60]

Jie Wang, Licheng Guo, and Jason Cong. AutoSA: a polyhedral compiler for high-performance systolic arrays on FPGA. In SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2021.

[61]

Floris-Jan Willemsen, Rob van Nieuwpoort, and Ben van Werkhoven. Bayesian optimization for auto-tuning gpu kernels. In International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2021.

[62]

Nan Wu, Yuan Xie, and Cong Hao. Ironman: Gnn-assisted design space exploration in high-level synthesis via reinforcement learning. In Great Lakes Symposium on VLSI, 2021.

[63]

Xingfu Wu, Michael Kruse, Prasanna Balaprakash, Hal Finkel, Paul Hovland, Valerie Taylor, and Mary Hall. Autotuning PolyBench benchmarks with LLVM Clang/Polly loop optimization pragmas using bayesian optimization (extended version). arXiv preprint arXiv:2104.13242, 2021.

[64]

Guanwen Zhong, Alok Prakash, Yun Liang, Tulika Mitra, and Smail Niar. Lin-Analyzer: A high-level performance analysis tool for fpga-based accelerators. In ACM/EDAC/IEEE Design Automation Conference (DAC), 2016.

[65]

Xinran Zhu, Yang Liu, Pieter Ghysels, David Bindel, and Xiaoye S Li. GPTuneBand: Multi-task and multi-fidelity autotuning for large-scale high performance computing applications. In Parallel Processing for Scientific Computing, 2022.

Cited By

Zhang GHsu OKjolstad F(2024)Compilation of Modular and General Sparse WorkspacesProceedings of the ACM on Programming Languages10.1145/36564268:PLDI(1213-1238)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656426

Index Terms

BaCO: A Fast and Portable Bayesian Compiler Optimization Framework
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance
2. Mathematics of computing
  1. Mathematical analysis
    1. Mathematical optimization

Recommendations

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
Portable performance on heterogeneous architectures
ASPLOS '13

Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now ...
FAST: A Fast Stencil Autotuning Framework Based On An Optimal-solution Space Model
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

Stencil computations comprise an important class of kernels in many scientific computing applications. As the diversity of both architectures and programming models grow, autotuning is emerging as a critical strategy for achieving portable performance ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '23: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 4

March 2023

430 pages

ISBN:9798400703942

DOI:10.1145/3623278

Chair:
Tor Aamodt,
Program Chair:
Michael M Swift,
Program Co-chair:
Natalie Enright Jerger

Copyright © 2023 Owner/Author(s).

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 February 2024

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '23

Sponsor:

ASPLOS '23: 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 4

March 25 - 29, 2023

BC, Vancouver, Canada

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
476
Total Downloads

Downloads (Last 12 months)476
Downloads (Last 6 weeks)101

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang GHsu OKjolstad F(2024)Compilation of Modular and General Sparse WorkspacesProceedings of the ACM on Programming Languages10.1145/36564268:PLDI(1213-1238)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656426

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents