research-article

Using automated performance modeling to find scalability bugs in complex codes

Authors:

Alexandru Calotoiu,

Torsten Hoefler,

Felix WolfAuthors Info & Claims

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Article No.: 45, Pages 1 - 12

https://doi.org/10.1145/2503210.2503277

Published: 17 November 2013 Publication History

Abstract

Many parallel applications suffer from latent performance limitations that may prevent them from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only when an attempt to scale the code is actually being made---a point where remediation can be difficult. However, creating analytical performance models that would allow such issues to be pinpointed earlier is so laborious that application developers attempt it at most for a few selected kernels, running the risk of missing harmful bottlenecks. In this paper, we show how both coverage and speed of this scalability analysis can be substantially improved. Generating an empirical performance model automatically for each part of a parallel program, we can easily identify those parts that will reduce performance at larger core counts. Using a climate simulation as an example, we demonstrate that scalability bugs are not confined to those routines usually chosen as kernels.

References

[1]

R. Bagrodia, E. Deelman, and T. Phan. Parallel simulation of large-scale parallel applications. Intl. Journal of High Performance Computing Applications, 15(1):3--12, 2001.

Digital Library

[2]

B. J. Barnes, B. Rountree, D. K. Lowenthal, J. Reeves, B. de Supinski, and M. Schulz. A regression-based approach to scalability prediction. In Proc. of the Intl. Conference on Supercomputing, (ICS), pages 368--377. ACM, 2008.

Digital Library

[3]

G. Bauer, S. Gottlieb, and T. Hoefler. Performance modeling and comparative analysis of the MILC lattice QCD application su3_rmd. In Proc. of CCGrid, May 2012.

Digital Library

[4]

A. Blum, A. Kalai, and J. Langford. Beating the hold-out: bounds for k-fold and progressive cross-validation. In Proc. of the 12th Annual Conference on Computational Learning Theory, (COLT), pages 203--208. ACM, 1999.

Digital Library

[5]

P. Bohrer, J. Peterson, M. Elnozahy, R. Rajamony, A. Gheith, R. Rockhold, C. Lefurgy, H. Shafi, T. Nakra, R. Simpson, E. Speight, K. Sudeep, E. Van Hensbergen, and L. Zhang. Mambo: a full system simulator for the PowerPC architecture. SIGMETRICS Performance Eval. Review, 31:8--12, March 2004.

Digital Library

[6]

E. L. Boyd, W. Azeem, H.-H. Lee, T.-P. Shih, S.-H. Hung, and E. S. Davidson. A hierarchical approach to modeling and improving the performance of scientific applications on the ksr1. In Proc. of the Intl. Conference on Parallel Processing, (ICPP), pages 188--192. IEEE Computer Society, 1994.

Digital Library

[7]

L. Carrington, A. Snavely, and N. Wolter. A performance prediction framework for scientific applications. Future Generation Computer Systems, 22(3):336--346, February 2006.

Digital Library

[8]

D. S. Carter. Comparison of different shrinkage formulas in estimating population multiple correlation coefficients. Educational and Psychological Measurement, 39(2):261--266, 1979.

[9]

H. Casanova, A. Legrand, and M. Quinson. Simgrid: a generic framework for large-scale distributed experiments. In Proc. of the 10th Intl. Conference on Computer Modeling and Simulation, (UKSIM), pages 126--131. IEEE Computer Society, 2008.

Digital Library

[10]

C. Coarfa, J. Mellor-Crummey, N. Froyd, and Y. Dotsenko. Scalability analysis of SPMD codes using expectations. In Proc. of the 21st Intl. Conference on Supercomputing, (ICS), pages 13--22. ACM, 2007.

Digital Library

[11]

J. M. Dennis, J. Edwards, K. J. Evans, O. Guba, P. H. Lauritzen, A. A. Mirin, A. St-Cyr, M. A. Taylor, and P. H. Worley. CAM-SE: A scalable spectral element dynamical core for the community atmosphere model. Intl. Journal of High Performance Computing Applications, 26(1):74--89, 2012.

Digital Library

[12]

M. Geimer, F. Wolf, B. J. N. Wylie, E. Ábrahám, D. Becker, and B. Mohr. The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience, 22(6):702--719, April 2010.

Digital Library

[13]

A. Y. Grama, A. Gupta, and V. Kumar. Isoefficiency: measuring the scalability of parallel algorithms and architectures. Parallel Distributed Technology: Systems Applications, IEEE, 1(3):12--21, 1993.

Digital Library

[14]

D. M. Hawkins, S. C. Basak, and D. Mills. Assessing model fit by cross-validation. Journal of Chemical Information and Computer Sciences, 43(2):579--586, 2003.

[15]

M.-A. Hermanns, M. Geimer, F. Wolf, and B. J. N. Wylie. Verifying causality between distant performance phenomena in large-scale MPI applications. In Proc. of the 17th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), pages 78--84. IEEE Computer Society, February 2009.

Digital Library

[16]

T. Hoefler, W. Gropp, W. Kramer, and M. Snir. Performance modeling for systematic performance tuning. In State of the Practice Reports, SC '11, pages 6:1--6:12. ACM, 2011.

Digital Library

[17]

T. Hoefler, T. Schneider, and A. Lumsdaine. Characterizing the influence of system noise on large-scale applications by simulation. In Proc. of the ACM/IEEE Conference on Supercomputing (SC '10), November 2010.

Digital Library

[18]

T. Hoefler, T. Schneider, and A. Lumsdaine. LogGOPSim: simulating large-scale applications in the LogGOPS model. In Proc. of the 19th ACM Intl. Symposium on High Performance Distributed Computing, (HPDC), pages 597--604. ACM, 2010.

Digital Library

[19]

A. Hoisie, O. M. Lubeck, and H. J. Wasserman. Performance analysis of wavefront algorithms on very-large scale distributed systems. In Workshop on Wide Area Networks and High Performance Computing, pages 171--187. Springer-Verlag, 1999.

Digital Library

[20]

E. Ipek, B. R. de Supinski, M. Schulz, and S. A. McKee. An approach to performance prediction for parallel applications. In Proc. of the 11th Intl. Euro-Par Conference, pages 196--205. Springer-Verlag, 2005.

Digital Library

[21]

D. J. Kerbyson, H. J. Alme, A. Hoisie, F. Petrini, H. J. Wasserman, and M. Gittings. Predictive performance and scalability modeling of a large-scale application. In Proc. of the ACM/IEEE Conference on Supercomputing (SC'01), page 37. ACM, 2001.

Digital Library

[22]

B. C. Lee, D. M. Brooks, B. R. de Supinski, M. Schulz, K. Singh, and S. A. McKee. Methods of inference and learning for performance modeling of parallel applications. In Proc. of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, (PPoPP '07), pages 249--258. ACM, 2007.

Digital Library

[23]

Los Alamos National Laboratory. ASCI SWEEP3D v2.2b: Three-dimensional discrete ordinates neutron transport benchmark, 1995.

[24]

G. Marin and J. Mellor-Crummey. Cross-architecture performance predictions for scientific applications using parameterized models. SIGMETRICS Performance Eval. Review, 32(1):2--13, June 2004.

Digital Library

[25]

M. M. Mathis, N. M. Amato, and M. L. Adams. A general performance model for parallel sweeps on orthogonal grids for particle transport calculations. Technical report, College Station, TX, USA, 2000.

Digital Library

[26]

C. Mei. A preliminary investigation of emulating applications that use petabytes of memory on petascale machines. Master's thesis, University of Illinois at Urbana-Champaign, 2007.

[27]

G. R. Nudd, D. J. Kerbyson, E. Papaefstathiou, S. C. Perry, J. S. Harper, and D. V. Wilcox. Pace--a toolset for the performance prediction of parallel and distributed systems. Intl. Journal of High Performance Computing Applications, 14(3):228--251, August 2000.

Digital Library

[28]

F. Petrini, D. J. Kerbyson, and S. Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proc. of the ACM/IEEE Conference on Supercomputing, (SC '03), page 55. ACM, 2003.

Digital Library

[29]

R. R. Picard and R. D. Cook. Cross-validation of regression models. Journal of the American Statistical Association, 79(387):575--583, 1984.

[30]

S. Pllana, I. Brandic, and S. Benkner. Performance modeling and prediction of parallel and distributed computing systems: A survey of the state of the art. In Proc. of the 1st Intl. Conference on Complex, Intelligent and Software Intensive Systems (CISIS), pages 279--284, 2007.

Digital Library

[31]

A. F. Rodrigues, R. C. Murphy, P. Kogge, and K. D. Underwood. The structural simulation toolkit: exploring novel architectures. In Proc. of the ACM/IEEE Conference on Supercomputing, (SC '06). ACM, 2006.

Digital Library

[32]

F. Song, F. Wolf, N. Bhatia, J. Dongarra, and S. Moore. An algebra for cross-experiment performance analysis. In Proc. of the Intl. Conference on Parallel Processing (ICPP), pages 63--72, Montreal, Canada, August 2004. IEEE Society.

Digital Library

[33]

V. Subotic, J. C. Sancho, J. Labarta, and M. Valero. A simulation framework to automatically analyze the communication-computation overlap in scientific applications. In Proc. of the IEEE Conference on Cluster Computing, (Cluster '10), pages 275--283. IEEE Computer Society, 2010.

Digital Library

[34]

M. M. Tikir, M. A. Laurenzano, L. Carrington, and A. Snavely. PSINS: An open source event tracer and execution simulator for MPI applications. In Proc. of the Euro-Par Conference, pages 135--148. Springer-Verlag, 2009.

Digital Library

[35]

H. Wasserman, A. Hoisie, O. Lubeck, and O. Lubeck. Performance and scalability analysis of teraflop-scale parallel architectures using multidimensional wavefront applications. The Intl. Journal of High Performance Computing Applications, 14:330--346, 2000.

Digital Library

[36]

S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, April 2009.

Digital Library

[37]

X. Wu and F. Müller. Scalaextrap: Trace-based communication extrapolation for spmd programs. ACM Transactions on Programming Languages and Systems, 34(1), April 2012.

Digital Library

[38]

B. J. N. Wylie, M. Geimer, B. Mohr, D. Boehme, Z. Szebenyi, and F. Wolf. Large-scale performance analysis of SWEEP3D with the Scalasca toolset. Parallel Processing Letters, 20(04):397--414, 2010.

[39]

L. T. Yang, X. Ma, and F. Mueller. Cross-platform performance prediction of parallel applications using partial execution. In Proc. of the ACM/IEEE Conference on Supercomputing, (SC '05), page 40. IEEE Computer Society, 2005.

Digital Library

[40]

J. Zhai, W. Chen, and W. Zheng. Phantom: predicting performance of parallel applications on large-scale parallel machines using a single node. SIGPLAN Notices, 45(5):305--314, January 2010.

Digital Library

[41]

P. Zhang. Model selection via multifold cross validation. The Annals of Statistics, 21(1):pp. 299--313, 1993.

[42]

G. Zheng, G. Kakulapati, and L. V. Kalé. Bigsim: A parallel simulator for performance prediction of extremely large parallel machines. In Proc. of the 18th Intl. Parallel and Distributed Processing Symposium (IPDPS), page 78, April 2004.

Cited By

Jin YWang HTang XGuo ZZhao YHoefler TLiu TLiu XZhai J(2025)Leveraging Graph Analysis to Pinpoint Root Causes of Scalability Issues for Parallel ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.348578936:2(308-325)Online publication date: Feb-2025
https://doi.org/10.1109/TPDS.2024.3485789
Luo HCho YDemmel JKozachenko ILi XLiu Y(2024)Non-smooth Bayesian optimization in tuning scientific applicationsThe International Journal of High Performance Computing Applications10.1177/1094342024127898138:6(633-657)Online publication date: 6-Sep-2024
https://doi.org/10.1177/10943420241278981
Schmid LSağlam TSelzer MKoziolek ALiem RAfzal AKousha PZhu ZLee J(2024)Cost-Efficient Construction of Performance ModelsProceedings of the 4th Workshop on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy10.1145/3660317.3660322(1-7)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3660317.3660322
Show More Cited By

Recommendations

Using evolution patterns to find duplicated bugs
Scalability Bugs: When 100-Node Testing is Not Enough
HotOS '17: Proceedings of the 16th Workshop on Hot Topics in Operating Systems

We highlight the problem of scalability bugs, a new class of bugs that appear in "cloud-scale" distributed systems. Scalability bugs are latent bugs that are cluster-scale dependent, whose symptoms typically surface in large-scale deployments, but not ...
Discovering, reporting, and fixing performance bugs
MSR '13: Proceedings of the 10th Working Conference on Mining Software Repositories

Software performance is critical for how users perceive the quality of software products. Performance bugs---programming errors that cause significant performance degradation---lead to poor user experience and low system throughput. Designing effective ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

November 2013

1123 pages

ISBN:9781450323789

DOI:10.1145/2503210

General Chair:
William Gropp
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Satoshi Matsuoka
Tokyo Institute of Technology, Tokyo, Japan

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC13

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC13: International Conference for High Performance Computing, Networking, Storage and Analysis

November 17 - 21, 2013

Colorado, Denver

Acceptance Rates

SC '13 Paper Acceptance Rate 91 of 449 submissions, 20%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

107
Total Citations
View Citations
713
Total Downloads

Downloads (Last 12 months)78
Downloads (Last 6 weeks)6

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jin YWang HTang XGuo ZZhao YHoefler TLiu TLiu XZhai J(2025)Leveraging Graph Analysis to Pinpoint Root Causes of Scalability Issues for Parallel ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.348578936:2(308-325)Online publication date: Feb-2025
https://doi.org/10.1109/TPDS.2024.3485789
Luo HCho YDemmel JKozachenko ILi XLiu Y(2024)Non-smooth Bayesian optimization in tuning scientific applicationsThe International Journal of High Performance Computing Applications10.1177/1094342024127898138:6(633-657)Online publication date: 6-Sep-2024
https://doi.org/10.1177/10943420241278981
Schmid LSağlam TSelzer MKoziolek ALiem RAfzal AKousha PZhu ZLee J(2024)Cost-Efficient Construction of Performance ModelsProceedings of the 4th Workshop on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy10.1145/3660317.3660322(1-7)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3660317.3660322
Corbin GDaoud NMohr BDe Morais GWolf F(2024)Are Noise-Resilient Logical Timers Useful for Performance Analysis?SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00192(1519-1530)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SCW63240.2024.00192
Copik MChrapek MSchmid LCalotoiu AHoefler T(2024)Software Resource Disaggregation for HPC with Serverless Computing2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00021(139-156)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00021
Almaaitah NSingh DÖzden TCarretero J(2024)Performance-driven scheduling for malleable workloadsThe Journal of Supercomputing10.1007/s11227-023-05882-080:8(11556-11584)Online publication date: 29-Jan-2024
https://doi.org/10.1007/s11227-023-05882-0
Harutyunyan SCésar ESikora AFilipovič JDutta AJannesari AAlcaraz J(2024)Efficient Code Region Characterization Through Automatic Performance Counters Reduction Using Machine Learning TechniquesEuro-Par 2024: Parallel Processing10.1007/978-3-031-69577-3_2(18-32)Online publication date: 26-Aug-2024
https://doi.org/10.1007/978-3-031-69577-3_2
Mohammadi SRothenberger Lde Morais GGörlich BLille ERüthers HWolf F(2023)Filtering and Ranking of Code Regions for Parallelization via Hotspot Detection and OpenMP Overhead AnalysisProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624206(1368-1379)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624206
Pearce OScott ABecker GHaque RHanford NBrink SJacobsen DPoxon HDomke JGamblin T(2023)Towards Collaborative Continuous Benchmarking for HPCProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624135(627-635)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624135
Brink SMcKinsey MBoehme DScully-Allison CLumsden IHawkins DBurgess TLama VLüttgau JIsaacs KTaufer MPearce OButt AMi NChard K(2023)Thicket: Seeing the Performance Experiment Forest for the Individual Run TreesProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3592989(281-293)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3588195.3592989
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents