research-article

Open access

On Using the Roofline Model with Lower Bounds on Data Movement

Authors:

Venmugil Elango,

Naser Sedaghati,

Fabrice Rastello,

Louis-Noël Pouchet,

Radu Teodorescu,

P. SadayappanAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 11, Issue 4

Article No.: 67, Pages 1 - 23

https://doi.org/10.1145/2693656

Published: 09 January 2015 Publication History

Abstract

The roofline model is a popular approach for “bound and bottleneck” performance analysis. It focuses on the limits to the performance of processors because of limited bandwidth to off-chip memory. It models upper bounds on performance as a function of operational intensity, the ratio of computational operations per byte of data moved from/to memory. While operational intensity can be directly measured for a specific implementation of an algorithm on a particular target platform, it is of interest to obtain broader insights on bottlenecks, where various semantically equivalent implementations of an algorithm are considered, along with analysis for variations in architectural parameters. This is currently very cumbersome and requires performance modeling and analysis of many variants.

In this article, we address this problem by using the roofline model in conjunction with upper bounds on the operational intensity of computations as a function of cache capacity, derived from lower bounds on data movement. This enables bottleneck analysis that holds across all dependence-preserving semantically equivalent implementations of an algorithm. We demonstrate the utility of the approach in assessing fundamental limits to performance and energy efficiency for several benchmark algorithms across a design space of architectural variations.

References

[1]

Alok Aggarwal, Bowen Alpern, Ashok K. Chandra, and Marc Snir. 1987. A model for hierarchical memory. In Proceedings of the 19th STOC. 305--314.

Digital Library

[2]

Alok Aggarwal and Jeffrey S. Vitter. 1988. The input/output complexity of sorting and related problems. Commun. ACM 31, 9 (1988), 1116--1127.

Digital Library

[3]

Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2011. Minimizing communication in numerical linear algebra. SIAM J. Matrix Anal. Appl. 32, 3 (2011), 866--901.

[4]

Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2012. Graph expansion and communication costs of fast matrix multiplication. J. ACM 59, 6 (2012), 32.

Digital Library

[5]

Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally, Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, Jon Hiller, et al. 2008. Exascale Computing Study: Technology Challenges in Achieving Exascale Systems. Tech. Rep., Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO).

[6]

Gianfranco Bilardi and Enoch Peserico. 2001. A characterization of temporal locality and its portability across memory hierarchies. In Proceedings of the 28th International Colloquium on Automata Languages and Programming. 128--139.

Digital Library

[7]

Gianfranco Bilardi, Andrea Pietracaprina, and Paolo D’Alberto. 2000. On the space and access complexity of computation DAGs. In Graph-Theoretic Concepts in Computer Science. 81--92.

Digital Library

[8]

Gianfranco Bilardi, Andrea Pietracaprina, Geppino Pucci, Fabio Schifano, and Raffaele Tripiccione. 2005. The potential of on-chip multiprocessing for QCD machines. In High Performance Computing HiPC 2005, D. Bader, M. Parashar, V. Sridhar, and V. Prasanna (Eds.). Lecture Notes in Computer Science, Vol. 3769. Springer, Berlin, 386--397. http://dx.doi.org/10.1007/11602569_41

Digital Library

[9]

Gianfranco Bilardi and Franco P. Preparata. 1999. Processor - time tradeoffs under bounded-speed message propagation: Part II, lower bounds. Theory Comput. Syst. 32, 5 (1999), 531--559.

[10]

Gianfranco Bilardi, Michele Scquizzato, and Francesco Silvestri. 2012. A lower bound technique for communication on BSP with application to the FFT. In Euro-Par. 676--687.

Digital Library

[11]

Jee Whan Choi, Daniel Bedard, Robert Fowler, and Richard Vuduc. 2013. A roofline model of energy. In Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing (IPDPS’13). 661--672.

Digital Library

[12]

Jee Whan Choi, Marat Dukhan, Xing Liu, and Richard Vuduc. 2014. Algorithmic time, energy, and power on candidate HPC compute building blocks. In Proceedings of the 2014 IEEE 28th International Symposium on Parallel and Distributed Processing (IPDPS’14). 1--11.

Digital Library

[13]

Michael Christ, James Demmel, Nicholas Knight, Thomas Scanlon, and Katherine Yelick. 2013. Communication Lower Bounds and Optimal Algorithms for Programs That Reference Arrays Part 1. EECS Technical Report EECS--2013-61, University of California, Berkeley.

[14]

Stephen A. Cook. 1974. An observation on time-storage trade off. J. Comput. Syst. Sci. 9, 3 (1974), 308--316.

Digital Library

[15]

Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, and Richard Vuduc. 2011. Balance principles for algorithm-architecture co-design. In Proceedings of the 3rd USENIX Conference on Hot topics in Parallelism (HotPar’11). 1--5.

Digital Library

[16]

James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou. 2012. Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34, 1 (2012).

Digital Library

[17]

Venmugil Elango, Fabrice Rastello, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. Data Access Complexity: The Red/Blue Pebble Game Revisited. Technical Report. OSU/INRIA/LSU/UCLA. OSU-CISRC-7/13-TR16.

[18]

Venmugil Elango, Fabrice Rastello, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2014. On characterizing the data movement complexity of computational DAGs for parallel execution. In Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures. ACM, 296--306.

Digital Library

[19]

Samuel H. Fuller and Lynette I. Millett. 2011. The Future of Computing Performance: Game Over or Next Level&quest; National Academies Press. Retrieved from http://www.nap.edu/openbook.php&quest;record_id=12980.

Digital Library

[20]

Magnus Rudolph Hestenes and Eduard Stiefel. 1952. Methods of conjugate gradients for solving linear systems. J. Res. Nat. Bur. Stand. 49, 6, 2379.

[21]

Jia-Wei Hong and H. T. Kung. 1981. I/O complexity: The red-blue pebble game. In Proceedings of the 13th Annual ACM Symposium on Theory of Computing (STOC’81). ACM, 326--333.

Digital Library

[22]

Dror Irony, Sivan Toledo, and Alexandre Tiskin. 2004. Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64, 9 (2004), 1017--1026.

Digital Library

[23]

Benjamin Lipshitz, James Demmel, Andrew Gearhart, and Oded Schwartz. 2013. Perfect strong scaling using no additional energy. In 2013 IEEE 27th International Symposium on Parallel Distributed Processing (IPDPS). 649--660. http://dx.doi.org/10.1109/IPDPS.2013.32

Digital Library

[24]

Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In MICRO. 469--480.

Digital Library

[25]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Technical Report HPL-2009-85. HP Labs.

[26]

Desh Ranjan, John Savage, and Mohammad Zubair. 2011. Strong I/O lower bounds for binomial and FFT computation graphs. In Computing and Combinatorics. LNCS, Vol. 6842. Springer, 134--145.

Digital Library

[27]

Desh Ranjan, John E. Savage, and Mohammad Zubair. 2012. Upper and lower I/O bounds for pebbling r-pyramids. J. Discrete Algorithms 14 (2012), 2--12.

Digital Library

[28]

Desh Ranjan and Mohammad Zubair. 2012. Vertex isoperimetric parameter of a Computation Graph. Int. J. Found. Comput. Sci. 23, 4 (2012), 941--964.

[29]

Stefan Rusu, Simon Tam, Harry Muljono, Jason Stinson, David Ayers, Jonathan Chang, Raj Varada, Matt Ratta, and Sailesh Kottapalli. 2009. A 45nm 8-core enterprise Xeon processor. In IEEE Asian Solid-State Circuits Conference, 2009. A-SSCC 2009. 9--12.

[30]

John E. Savage. 1995. Extending the Hong-Kung model to memory hierarchies. In Computing and Combinatorics. LNCS, Vol. 959. 270--281.

Digital Library

[31]

John E. Savage. 1998. Models of Computation. Addison-Wesley.

[32]

John E. Savage and Mohammad Zubair. 2008. A unified model for multicore architectures. In Proceedings of the 1st International Forum on Next-Generation Multicore/Manycore Technologies. ACM, 9.

Digital Library

[33]

John E. Savage and Mohammad Zubair. 2010. Cache-optimal algorithms for option pricing. ACM Trans. Math. Softw. 37, 1 (2010).

Digital Library

[34]

Michele Scquizzato and Francesco Silvestri. 2013. Communication lower bounds for distributed-memory computations. CoRR abs/1307.1805 (2013).

[35]

John Shalf, Sudip Dosanjh, and John Morrison. 2011. Exascale computing technology challenges. High Performance Computing for Computational Science--VECPAR 2010 (2011), 1--25.

Digital Library

[36]

Edgar Solomonik, Aydin Buluc, and James Demmel. 2013. Minimizing communication in all-pairs shortest paths. In IPDPS.

Digital Library

[37]

Leslie G. Valiant. 2011. A bridging model for multi-core computing. J. Comput. Syst. Sci. 77, 1 (Jan. 2011), 154--166.

Digital Library

[38]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (April 2009), 65--76.

Digital Library

Cited By

Karp MPodobas AKenter TJansson NPlessl CSchlatter PMarkidis S(2022)A High-Fidelity Flow Solver for Unstructured Meshes on Field-Programmable Gate Arrays: Design, Evaluation, and Future ChallengesInternational Conference on High Performance Computing in Asia-Pacific Region10.1145/3492805.3492808(125-136)Online publication date: 7-Jan-2022
https://dl.acm.org/doi/10.1145/3492805.3492808
Suetterlein JLandwehr JMarquez AManzano JBarker KGao G(2017)Verification of the Extended Roofline Model for Asynchronous Many Task RuntimesProceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware10.1145/3152041.3152087(1-8)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3152041.3152087
Ilic APratas FSousa L(2017)Beyond the RooflineIEEE Transactions on Computers10.1109/TC.2016.258215166:1(52-58)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1109/TC.2016.2582151
Show More Cited By

Index Terms

On Using the Roofline Model with Lower Bounds on Data Movement

Recommendations

Metrics and Design of an Instruction Roofline Model for AMD GPUs
Due to the recent announcement of the Frontier supercomputer, many scientific application developers are working to make their applications compatible with AMD (CPU-GPU) architectures, which means moving away from the traditional CPU and NVIDIA-GPU ...
Roofline-aware DVFS for GPUs
ADAPT '14: Proceedings of International Workshop on Adaptive Self-tuning Computing Systems

Graphics processing units (GPUs) are becoming increasingly popular for compute workloads, mainly because of their large number of processing elements and high-bandwidth to off-chip memory. The roofline model captures the ratio between the two (the ...
I/O lower bounds for auto-tuning of convolutions in CNNs
PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Convolution is the most time-consuming part in the computation of convolutional neural networks (CNNs), which have achieved great successes in numerous practical applications. Due to the complex data dependency and the increase in the amount of model ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 11, Issue 4

January 2015

797 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2695583

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 January 2015

Accepted: 01 November 2014

Revised: 01 November 2014

Received: 01 June 2014

Published in TACO Volume 11, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

U.S. Department of Energy
U.S. National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
779
Total Downloads

Downloads (Last 12 months)107
Downloads (Last 6 weeks)10

Reflects downloads up to 22 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Karp MPodobas AKenter TJansson NPlessl CSchlatter PMarkidis S(2022)A High-Fidelity Flow Solver for Unstructured Meshes on Field-Programmable Gate Arrays: Design, Evaluation, and Future ChallengesInternational Conference on High Performance Computing in Asia-Pacific Region10.1145/3492805.3492808(125-136)Online publication date: 7-Jan-2022
https://dl.acm.org/doi/10.1145/3492805.3492808
Suetterlein JLandwehr JMarquez AManzano JBarker KGao G(2017)Verification of the Extended Roofline Model for Asynchronous Many Task RuntimesProceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware10.1145/3152041.3152087(1-8)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3152041.3152087
Ilic APratas FSousa L(2017)Beyond the RooflineIEEE Transactions on Computers10.1109/TC.2016.258215166:1(52-58)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1109/TC.2016.2582151
Edwards JVishkin U(2016)FFT on XMT: Case Study of a Bandwidth-Intensive Regular Algorithm on a Highly-Parallel Many Core2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2016.157(561-569)Online publication date: May-2016
https://doi.org/10.1109/IPDPSW.2016.157

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents