Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

On Using the Roofline Model with Lower Bounds on Data Movement

Published: 09 January 2015 Publication History

Abstract

The roofline model is a popular approach for “bound and bottleneck” performance analysis. It focuses on the limits to the performance of processors because of limited bandwidth to off-chip memory. It models upper bounds on performance as a function of operational intensity, the ratio of computational operations per byte of data moved from/to memory. While operational intensity can be directly measured for a specific implementation of an algorithm on a particular target platform, it is of interest to obtain broader insights on bottlenecks, where various semantically equivalent implementations of an algorithm are considered, along with analysis for variations in architectural parameters. This is currently very cumbersome and requires performance modeling and analysis of many variants.
In this article, we address this problem by using the roofline model in conjunction with upper bounds on the operational intensity of computations as a function of cache capacity, derived from lower bounds on data movement. This enables bottleneck analysis that holds across all dependence-preserving semantically equivalent implementations of an algorithm. We demonstrate the utility of the approach in assessing fundamental limits to performance and energy efficiency for several benchmark algorithms across a design space of architectural variations.

References

[1]
Alok Aggarwal, Bowen Alpern, Ashok K. Chandra, and Marc Snir. 1987. A model for hierarchical memory. In Proceedings of the 19th STOC. 305--314.
[2]
Alok Aggarwal and Jeffrey S. Vitter. 1988. The input/output complexity of sorting and related problems. Commun. ACM 31, 9 (1988), 1116--1127.
[3]
Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2011. Minimizing communication in numerical linear algebra. SIAM J. Matrix Anal. Appl. 32, 3 (2011), 866--901.
[4]
Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2012. Graph expansion and communication costs of fast matrix multiplication. J. ACM 59, 6 (2012), 32.
[5]
Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally, Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, Jon Hiller, et al. 2008. Exascale Computing Study: Technology Challenges in Achieving Exascale Systems. Tech. Rep., Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO).
[6]
Gianfranco Bilardi and Enoch Peserico. 2001. A characterization of temporal locality and its portability across memory hierarchies. In Proceedings of the 28th International Colloquium on Automata Languages and Programming. 128--139.
[7]
Gianfranco Bilardi, Andrea Pietracaprina, and Paolo D’Alberto. 2000. On the space and access complexity of computation DAGs. In Graph-Theoretic Concepts in Computer Science. 81--92.
[8]
Gianfranco Bilardi, Andrea Pietracaprina, Geppino Pucci, Fabio Schifano, and Raffaele Tripiccione. 2005. The potential of on-chip multiprocessing for QCD machines. In High Performance Computing HiPC 2005, D. Bader, M. Parashar, V. Sridhar, and V. Prasanna (Eds.). Lecture Notes in Computer Science, Vol. 3769. Springer, Berlin, 386--397. http://dx.doi.org/10.1007/11602569_41
[9]
Gianfranco Bilardi and Franco P. Preparata. 1999. Processor - time tradeoffs under bounded-speed message propagation: Part II, lower bounds. Theory Comput. Syst. 32, 5 (1999), 531--559.
[10]
Gianfranco Bilardi, Michele Scquizzato, and Francesco Silvestri. 2012. A lower bound technique for communication on BSP with application to the FFT. In Euro-Par. 676--687.
[11]
Jee Whan Choi, Daniel Bedard, Robert Fowler, and Richard Vuduc. 2013. A roofline model of energy. In Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing (IPDPS’13). 661--672.
[12]
Jee Whan Choi, Marat Dukhan, Xing Liu, and Richard Vuduc. 2014. Algorithmic time, energy, and power on candidate HPC compute building blocks. In Proceedings of the 2014 IEEE 28th International Symposium on Parallel and Distributed Processing (IPDPS’14). 1--11.
[13]
Michael Christ, James Demmel, Nicholas Knight, Thomas Scanlon, and Katherine Yelick. 2013. Communication Lower Bounds and Optimal Algorithms for Programs That Reference Arrays Part 1. EECS Technical Report EECS--2013-61, University of California, Berkeley.
[14]
Stephen A. Cook. 1974. An observation on time-storage trade off. J. Comput. Syst. Sci. 9, 3 (1974), 308--316.
[15]
Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, and Richard Vuduc. 2011. Balance principles for algorithm-architecture co-design. In Proceedings of the 3rd USENIX Conference on Hot topics in Parallelism (HotPar’11). 1--5.
[16]
James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou. 2012. Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34, 1 (2012).
[17]
Venmugil Elango, Fabrice Rastello, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. Data Access Complexity: The Red/Blue Pebble Game Revisited. Technical Report. OSU/INRIA/LSU/UCLA. OSU-CISRC-7/13-TR16.
[18]
Venmugil Elango, Fabrice Rastello, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2014. On characterizing the data movement complexity of computational DAGs for parallel execution. In Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures. ACM, 296--306.
[19]
Samuel H. Fuller and Lynette I. Millett. 2011. The Future of Computing Performance: Game Over or Next Level? National Academies Press. Retrieved from http://www.nap.edu/openbook.php?record_id=12980.
[20]
Magnus Rudolph Hestenes and Eduard Stiefel. 1952. Methods of conjugate gradients for solving linear systems. J. Res. Nat. Bur. Stand. 49, 6, 2379.
[21]
Jia-Wei Hong and H. T. Kung. 1981. I/O complexity: The red-blue pebble game. In Proceedings of the 13th Annual ACM Symposium on Theory of Computing (STOC’81). ACM, 326--333.
[22]
Dror Irony, Sivan Toledo, and Alexandre Tiskin. 2004. Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64, 9 (2004), 1017--1026.
[23]
Benjamin Lipshitz, James Demmel, Andrew Gearhart, and Oded Schwartz. 2013. Perfect strong scaling using no additional energy. In 2013 IEEE 27th International Symposium on Parallel Distributed Processing (IPDPS). 649--660. http://dx.doi.org/10.1109/IPDPS.2013.32
[24]
Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In MICRO. 469--480.
[25]
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Technical Report HPL-2009-85. HP Labs.
[26]
Desh Ranjan, John Savage, and Mohammad Zubair. 2011. Strong I/O lower bounds for binomial and FFT computation graphs. In Computing and Combinatorics. LNCS, Vol. 6842. Springer, 134--145.
[27]
Desh Ranjan, John E. Savage, and Mohammad Zubair. 2012. Upper and lower I/O bounds for pebbling r-pyramids. J. Discrete Algorithms 14 (2012), 2--12.
[28]
Desh Ranjan and Mohammad Zubair. 2012. Vertex isoperimetric parameter of a Computation Graph. Int. J. Found. Comput. Sci. 23, 4 (2012), 941--964.
[29]
Stefan Rusu, Simon Tam, Harry Muljono, Jason Stinson, David Ayers, Jonathan Chang, Raj Varada, Matt Ratta, and Sailesh Kottapalli. 2009. A 45nm 8-core enterprise Xeon processor. In IEEE Asian Solid-State Circuits Conference, 2009. A-SSCC 2009. 9--12.
[30]
John E. Savage. 1995. Extending the Hong-Kung model to memory hierarchies. In Computing and Combinatorics. LNCS, Vol. 959. 270--281.
[31]
John E. Savage. 1998. Models of Computation. Addison-Wesley.
[32]
John E. Savage and Mohammad Zubair. 2008. A unified model for multicore architectures. In Proceedings of the 1st International Forum on Next-Generation Multicore/Manycore Technologies. ACM, 9.
[33]
John E. Savage and Mohammad Zubair. 2010. Cache-optimal algorithms for option pricing. ACM Trans. Math. Softw. 37, 1 (2010).
[34]
Michele Scquizzato and Francesco Silvestri. 2013. Communication lower bounds for distributed-memory computations. CoRR abs/1307.1805 (2013).
[35]
John Shalf, Sudip Dosanjh, and John Morrison. 2011. Exascale computing technology challenges. High Performance Computing for Computational Science--VECPAR 2010 (2011), 1--25.
[36]
Edgar Solomonik, Aydin Buluc, and James Demmel. 2013. Minimizing communication in all-pairs shortest paths. In IPDPS.
[37]
Leslie G. Valiant. 2011. A bridging model for multi-core computing. J. Comput. Syst. Sci. 77, 1 (Jan. 2011), 154--166.
[38]
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (April 2009), 65--76.

Cited By

View all
  • (2022)A High-Fidelity Flow Solver for Unstructured Meshes on Field-Programmable Gate Arrays: Design, Evaluation, and Future ChallengesInternational Conference on High Performance Computing in Asia-Pacific Region10.1145/3492805.3492808(125-136)Online publication date: 7-Jan-2022
  • (2017)Verification of the Extended Roofline Model for Asynchronous Many Task RuntimesProceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware10.1145/3152041.3152087(1-8)Online publication date: 12-Nov-2017
  • (2017)Beyond the RooflineIEEE Transactions on Computers10.1109/TC.2016.258215166:1(52-58)Online publication date: 1-Jan-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 11, Issue 4
January 2015
797 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2695583
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 January 2015
Accepted: 01 November 2014
Revised: 01 November 2014
Received: 01 June 2014
Published in TACO Volume 11, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. I/O lower bounds
  2. Operational intensity upper bounds
  3. algorithm-architecture codesign
  4. architecture design space exploration

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)107
  • Downloads (Last 6 weeks)10
Reflects downloads up to 22 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)A High-Fidelity Flow Solver for Unstructured Meshes on Field-Programmable Gate Arrays: Design, Evaluation, and Future ChallengesInternational Conference on High Performance Computing in Asia-Pacific Region10.1145/3492805.3492808(125-136)Online publication date: 7-Jan-2022
  • (2017)Verification of the Extended Roofline Model for Asynchronous Many Task RuntimesProceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware10.1145/3152041.3152087(1-8)Online publication date: 12-Nov-2017
  • (2017)Beyond the RooflineIEEE Transactions on Computers10.1109/TC.2016.258215166:1(52-58)Online publication date: 1-Jan-2017
  • (2016)FFT on XMT: Case Study of a Bandwidth-Intensive Regular Algorithm on a Highly-Parallel Many Core2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2016.157(561-569)Online publication date: May-2016

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media