Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2503210.2503277acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Using automated performance modeling to find scalability bugs in complex codes

Published: 17 November 2013 Publication History

Abstract

Many parallel applications suffer from latent performance limitations that may prevent them from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only when an attempt to scale the code is actually being made---a point where remediation can be difficult. However, creating analytical performance models that would allow such issues to be pinpointed earlier is so laborious that application developers attempt it at most for a few selected kernels, running the risk of missing harmful bottlenecks. In this paper, we show how both coverage and speed of this scalability analysis can be substantially improved. Generating an empirical performance model automatically for each part of a parallel program, we can easily identify those parts that will reduce performance at larger core counts. Using a climate simulation as an example, we demonstrate that scalability bugs are not confined to those routines usually chosen as kernels.

References

[1]
R. Bagrodia, E. Deelman, and T. Phan. Parallel simulation of large-scale parallel applications. Intl. Journal of High Performance Computing Applications, 15(1):3--12, 2001.
[2]
B. J. Barnes, B. Rountree, D. K. Lowenthal, J. Reeves, B. de Supinski, and M. Schulz. A regression-based approach to scalability prediction. In Proc. of the Intl. Conference on Supercomputing, (ICS), pages 368--377. ACM, 2008.
[3]
G. Bauer, S. Gottlieb, and T. Hoefler. Performance modeling and comparative analysis of the MILC lattice QCD application su3_rmd. In Proc. of CCGrid, May 2012.
[4]
A. Blum, A. Kalai, and J. Langford. Beating the hold-out: bounds for k-fold and progressive cross-validation. In Proc. of the 12th Annual Conference on Computational Learning Theory, (COLT), pages 203--208. ACM, 1999.
[5]
P. Bohrer, J. Peterson, M. Elnozahy, R. Rajamony, A. Gheith, R. Rockhold, C. Lefurgy, H. Shafi, T. Nakra, R. Simpson, E. Speight, K. Sudeep, E. Van Hensbergen, and L. Zhang. Mambo: a full system simulator for the PowerPC architecture. SIGMETRICS Performance Eval. Review, 31:8--12, March 2004.
[6]
E. L. Boyd, W. Azeem, H.-H. Lee, T.-P. Shih, S.-H. Hung, and E. S. Davidson. A hierarchical approach to modeling and improving the performance of scientific applications on the ksr1. In Proc. of the Intl. Conference on Parallel Processing, (ICPP), pages 188--192. IEEE Computer Society, 1994.
[7]
L. Carrington, A. Snavely, and N. Wolter. A performance prediction framework for scientific applications. Future Generation Computer Systems, 22(3):336--346, February 2006.
[8]
D. S. Carter. Comparison of different shrinkage formulas in estimating population multiple correlation coefficients. Educational and Psychological Measurement, 39(2):261--266, 1979.
[9]
H. Casanova, A. Legrand, and M. Quinson. Simgrid: a generic framework for large-scale distributed experiments. In Proc. of the 10th Intl. Conference on Computer Modeling and Simulation, (UKSIM), pages 126--131. IEEE Computer Society, 2008.
[10]
C. Coarfa, J. Mellor-Crummey, N. Froyd, and Y. Dotsenko. Scalability analysis of SPMD codes using expectations. In Proc. of the 21st Intl. Conference on Supercomputing, (ICS), pages 13--22. ACM, 2007.
[11]
J. M. Dennis, J. Edwards, K. J. Evans, O. Guba, P. H. Lauritzen, A. A. Mirin, A. St-Cyr, M. A. Taylor, and P. H. Worley. CAM-SE: A scalable spectral element dynamical core for the community atmosphere model. Intl. Journal of High Performance Computing Applications, 26(1):74--89, 2012.
[12]
M. Geimer, F. Wolf, B. J. N. Wylie, E. Ábrahám, D. Becker, and B. Mohr. The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience, 22(6):702--719, April 2010.
[13]
A. Y. Grama, A. Gupta, and V. Kumar. Isoefficiency: measuring the scalability of parallel algorithms and architectures. Parallel Distributed Technology: Systems Applications, IEEE, 1(3):12--21, 1993.
[14]
D. M. Hawkins, S. C. Basak, and D. Mills. Assessing model fit by cross-validation. Journal of Chemical Information and Computer Sciences, 43(2):579--586, 2003.
[15]
M.-A. Hermanns, M. Geimer, F. Wolf, and B. J. N. Wylie. Verifying causality between distant performance phenomena in large-scale MPI applications. In Proc. of the 17th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), pages 78--84. IEEE Computer Society, February 2009.
[16]
T. Hoefler, W. Gropp, W. Kramer, and M. Snir. Performance modeling for systematic performance tuning. In State of the Practice Reports, SC '11, pages 6:1--6:12. ACM, 2011.
[17]
T. Hoefler, T. Schneider, and A. Lumsdaine. Characterizing the influence of system noise on large-scale applications by simulation. In Proc. of the ACM/IEEE Conference on Supercomputing (SC '10), November 2010.
[18]
T. Hoefler, T. Schneider, and A. Lumsdaine. LogGOPSim: simulating large-scale applications in the LogGOPS model. In Proc. of the 19th ACM Intl. Symposium on High Performance Distributed Computing, (HPDC), pages 597--604. ACM, 2010.
[19]
A. Hoisie, O. M. Lubeck, and H. J. Wasserman. Performance analysis of wavefront algorithms on very-large scale distributed systems. In Workshop on Wide Area Networks and High Performance Computing, pages 171--187. Springer-Verlag, 1999.
[20]
E. Ipek, B. R. de Supinski, M. Schulz, and S. A. McKee. An approach to performance prediction for parallel applications. In Proc. of the 11th Intl. Euro-Par Conference, pages 196--205. Springer-Verlag, 2005.
[21]
D. J. Kerbyson, H. J. Alme, A. Hoisie, F. Petrini, H. J. Wasserman, and M. Gittings. Predictive performance and scalability modeling of a large-scale application. In Proc. of the ACM/IEEE Conference on Supercomputing (SC'01), page 37. ACM, 2001.
[22]
B. C. Lee, D. M. Brooks, B. R. de Supinski, M. Schulz, K. Singh, and S. A. McKee. Methods of inference and learning for performance modeling of parallel applications. In Proc. of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, (PPoPP '07), pages 249--258. ACM, 2007.
[23]
Los Alamos National Laboratory. ASCI SWEEP3D v2.2b: Three-dimensional discrete ordinates neutron transport benchmark, 1995.
[24]
G. Marin and J. Mellor-Crummey. Cross-architecture performance predictions for scientific applications using parameterized models. SIGMETRICS Performance Eval. Review, 32(1):2--13, June 2004.
[25]
M. M. Mathis, N. M. Amato, and M. L. Adams. A general performance model for parallel sweeps on orthogonal grids for particle transport calculations. Technical report, College Station, TX, USA, 2000.
[26]
C. Mei. A preliminary investigation of emulating applications that use petabytes of memory on petascale machines. Master's thesis, University of Illinois at Urbana-Champaign, 2007.
[27]
G. R. Nudd, D. J. Kerbyson, E. Papaefstathiou, S. C. Perry, J. S. Harper, and D. V. Wilcox. Pace--a toolset for the performance prediction of parallel and distributed systems. Intl. Journal of High Performance Computing Applications, 14(3):228--251, August 2000.
[28]
F. Petrini, D. J. Kerbyson, and S. Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proc. of the ACM/IEEE Conference on Supercomputing, (SC '03), page 55. ACM, 2003.
[29]
R. R. Picard and R. D. Cook. Cross-validation of regression models. Journal of the American Statistical Association, 79(387):575--583, 1984.
[30]
S. Pllana, I. Brandic, and S. Benkner. Performance modeling and prediction of parallel and distributed computing systems: A survey of the state of the art. In Proc. of the 1st Intl. Conference on Complex, Intelligent and Software Intensive Systems (CISIS), pages 279--284, 2007.
[31]
A. F. Rodrigues, R. C. Murphy, P. Kogge, and K. D. Underwood. The structural simulation toolkit: exploring novel architectures. In Proc. of the ACM/IEEE Conference on Supercomputing, (SC '06). ACM, 2006.
[32]
F. Song, F. Wolf, N. Bhatia, J. Dongarra, and S. Moore. An algebra for cross-experiment performance analysis. In Proc. of the Intl. Conference on Parallel Processing (ICPP), pages 63--72, Montreal, Canada, August 2004. IEEE Society.
[33]
V. Subotic, J. C. Sancho, J. Labarta, and M. Valero. A simulation framework to automatically analyze the communication-computation overlap in scientific applications. In Proc. of the IEEE Conference on Cluster Computing, (Cluster '10), pages 275--283. IEEE Computer Society, 2010.
[34]
M. M. Tikir, M. A. Laurenzano, L. Carrington, and A. Snavely. PSINS: An open source event tracer and execution simulator for MPI applications. In Proc. of the Euro-Par Conference, pages 135--148. Springer-Verlag, 2009.
[35]
H. Wasserman, A. Hoisie, O. Lubeck, and O. Lubeck. Performance and scalability analysis of teraflop-scale parallel architectures using multidimensional wavefront applications. The Intl. Journal of High Performance Computing Applications, 14:330--346, 2000.
[36]
S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, April 2009.
[37]
X. Wu and F. Müller. Scalaextrap: Trace-based communication extrapolation for spmd programs. ACM Transactions on Programming Languages and Systems, 34(1), April 2012.
[38]
B. J. N. Wylie, M. Geimer, B. Mohr, D. Boehme, Z. Szebenyi, and F. Wolf. Large-scale performance analysis of SWEEP3D with the Scalasca toolset. Parallel Processing Letters, 20(04):397--414, 2010.
[39]
L. T. Yang, X. Ma, and F. Mueller. Cross-platform performance prediction of parallel applications using partial execution. In Proc. of the ACM/IEEE Conference on Supercomputing, (SC '05), page 40. IEEE Computer Society, 2005.
[40]
J. Zhai, W. Chen, and W. Zheng. Phantom: predicting performance of parallel applications on large-scale parallel machines using a single node. SIGPLAN Notices, 45(5):305--314, January 2010.
[41]
P. Zhang. Model selection via multifold cross validation. The Annals of Statistics, 21(1):pp. 299--313, 1993.
[42]
G. Zheng, G. Kakulapati, and L. V. Kalé. Bigsim: A parallel simulator for performance prediction of extremely large parallel machines. In Proc. of the 18th Intl. Parallel and Distributed Processing Symposium (IPDPS), page 78, April 2004.

Cited By

View all
  • (2025)Leveraging Graph Analysis to Pinpoint Root Causes of Scalability Issues for Parallel ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.348578936:2(308-325)Online publication date: Feb-2025
  • (2024)Non-smooth Bayesian optimization in tuning scientific applicationsThe International Journal of High Performance Computing Applications10.1177/1094342024127898138:6(633-657)Online publication date: 6-Sep-2024
  • (2024)Cost-Efficient Construction of Performance ModelsProceedings of the 4th Workshop on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy10.1145/3660317.3660322(1-7)Online publication date: 3-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2013
1123 pages
ISBN:9781450323789
DOI:10.1145/2503210
  • General Chair:
  • William Gropp,
  • Program Chair:
  • Satoshi Matsuoka
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. performance analysis
  2. performance modeling
  3. scalability
  4. scalasca

Qualifiers

  • Research-article

Conference

SC13
Sponsor:

Acceptance Rates

SC '13 Paper Acceptance Rate 91 of 449 submissions, 20%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)78
  • Downloads (Last 6 weeks)6
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Leveraging Graph Analysis to Pinpoint Root Causes of Scalability Issues for Parallel ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.348578936:2(308-325)Online publication date: Feb-2025
  • (2024)Non-smooth Bayesian optimization in tuning scientific applicationsThe International Journal of High Performance Computing Applications10.1177/1094342024127898138:6(633-657)Online publication date: 6-Sep-2024
  • (2024)Cost-Efficient Construction of Performance ModelsProceedings of the 4th Workshop on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy10.1145/3660317.3660322(1-7)Online publication date: 3-Jun-2024
  • (2024)Are Noise-Resilient Logical Timers Useful for Performance Analysis?SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00192(1519-1530)Online publication date: 17-Nov-2024
  • (2024)Software Resource Disaggregation for HPC with Serverless Computing2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00021(139-156)Online publication date: 27-May-2024
  • (2024)Performance-driven scheduling for malleable workloadsThe Journal of Supercomputing10.1007/s11227-023-05882-080:8(11556-11584)Online publication date: 29-Jan-2024
  • (2024)Efficient Code Region Characterization Through Automatic Performance Counters Reduction Using Machine Learning TechniquesEuro-Par 2024: Parallel Processing10.1007/978-3-031-69577-3_2(18-32)Online publication date: 26-Aug-2024
  • (2023)Filtering and Ranking of Code Regions for Parallelization via Hotspot Detection and OpenMP Overhead AnalysisProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624206(1368-1379)Online publication date: 12-Nov-2023
  • (2023)Towards Collaborative Continuous Benchmarking for HPCProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624135(627-635)Online publication date: 12-Nov-2023
  • (2023)Thicket: Seeing the Performance Experiment Forest for the Individual Run TreesProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3592989(281-293)Online publication date: 7-Aug-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media